* Re: [linux-lvm] thin handling of available space
[not found] <1684768750.3193600.1461851163510.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-28 13:46 ` matthew patton
0 siblings, 0 replies; 29+ messages in thread
From: matthew patton @ 2016-04-28 13:46 UTC (permalink / raw)
To: LVM general discussion and development
> > The real question you should be asking is if it increases the monitoring
> > aspect (enhances it) if thin pool data is seen through the lens of the
> > filesystems as well.
Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.
> kernel for communication from lower fs layers to higher layers -
Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.
A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on the thin-pool were set to breach actual block availability, the script would refuse to grow the FS and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.
Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.
But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled.
Mindless use of thin-pools is akin to crossing a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-04 1:25 ` Mark Mielke
@ 2016-05-04 18:16 ` Xen
0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-04 18:16 UTC (permalink / raw)
To: LVM general discussion and development
Mark Mielke schreef op 04-05-2016 3:25:
> Thanks for entertaining this discussion, Matthew and Zdenek. I realize
> this is an open source project, with passionate and smart people,
> whose time is precious. I don't feel I have the capability of really
> contributing code changes at this time, and I'm satisfied that the
> ideas are being considered even if they ultimately don't get adopted.
> Even the mandatory warning about snapshots exceeding the volume group
> size is something I can continue to deal with using scripting and
> filtering. I mostly want to make sure that my perspective is known and
> understood.
You know, you really don't need to be this apologetic even if I mess up
my own replies ;-).
I think you have a right and a reason to say what you've said, and
that's it.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
[not found] <799090122.6079306.1462373733693.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-04 14:55 ` matthew patton
0 siblings, 0 replies; 29+ messages in thread
From: matthew patton @ 2016-05-04 14:55 UTC (permalink / raw)
To: LVM general discussion and development
On Tue, 5/3/16, Mark Mielke <mark.mielke@gmail.com> wrote:
> I get a bit lost here in the push towards BTRFS and ZFS for people with these expectations as
> I see BTRFS and ZFS as having a similar problem. They can both still fill up.
Well of course everything fills up eventually. BTRFS and ZFS are integrated systems where the FS can see into the block layer and "do" block layer activities vs the clear demarcation between XFS/EXT and LVM/MD.
If you write too much to a Thin FS today you get serious data loss. Oh sure, the metadata might have landed but the file contents sure didn't. Somebody (you?) mentioned how you seemingly were able to write 4x90GB to a 300GB block device and the FS fsck'd successfully. This doesn't happen in BTRFS/ZFS and friends. At 300.001GB you would have gotten a write error and the write operation would not have succeeded.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 12:00 ` matthew patton
2016-05-03 14:38 ` Xen
@ 2016-05-04 1:25 ` Mark Mielke
2016-05-04 18:16 ` Xen
1 sibling, 1 reply; 29+ messages in thread
From: Mark Mielke @ 2016-05-04 1:25 UTC (permalink / raw)
To: matthew patton, LVM general discussion and development
[-- Attachment #1: Type: text/plain, Size: 7566 bytes --]
On Tue, May 3, 2016 at 8:00 AM, matthew patton <pattonme@yahoo.com> wrote:
> > written as required. If the file system has particular areas
> > of importance that need to be writable to prevent file
> > system failure, perhaps the file system should have a way of
> > communicating this to the volume layer. The naive approach
> > here might be to preallocate these critical blocks before
> > proceeding with any updates to these blocks, such that the
> > failure situations can all be "safe" situations,
> > where ENOSPC can be returned without a danger of the file
> > system locking up or going read-only.
>
> why all of a sudden does each and every FS have to have this added code to
> second guess the block layer? The quickest solution is to mount the FS in
> sync mode. Go ahead and pay the performance piper. It's still not likely to
> be bullet proof but it's a sure step closer.
>
Not all of a sudden. From "at work" perspective, LVM thinp as a technology
is relatively recent, and only recently being deployed in more places as we
migrate our systems from RHEL 5 to RHEL 6 to RHEL 7. I didn't consider
thinp an option before RHEL 7, and I didn't consider it stable even in RHEL
7 without significant testing on our part.
From an "at home" perspective, I have been using LVM thinp from the day it
was available in a Fedora release. The previous snapshot model was
unusable, and I wished upon a star that a better technology would arrive. I
tried BTRFS and while it did work - it was still marked as experimental, it
did not have the exact same behaviour as EXT4 or XFS from an applications
perspective, and I did encounter some early issues with subvolumes.
Frankly... I was happy to have LVM thinp, and glad that you LVM developers
provided it when you did. It is excellent technology from my perspective.
But, "at home", I was willing to accept some loose edge case behaviour. I
know when I use storage on my server at home, and if it fails, I can accept
the consequences for myself.
"At work", the situation is different. These are critical systems that I am
betting LVM on. As we begin to use it more broadly (after over a year of
success in hosting our JIRA + Confluence instances on local flash using LVM
thinp for much of the application data including PostgreSQL databases). I
am very comfortable with it from a "< 80% capacity" perspective. However,
every so often it passes 80%, and I have to raise the alarm, because I know
that there are edge cases that LVM / DM thinp + XFS don't handle quite so
well. It's never happened in production yet, but I've seen it happen many
times on designer desktops when they are using LVM, and they lock up their
system and require a system reboot to recover from.
I know there are smart people working on Linux, and smart people working on
LVM. Give the opportunity, and the perspective, I think the worst of these
cases are problems that deserve to be addressed, and probably that people
have been working on with or without my contributions to the subject.
> What you're saying is that when mounting a block device the layer needs to
> expose a "thin-mode" attribute (or the sysdmin sets such a flag via
> tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry
> (does that actually work reliably?).
>
> Then there has to be code in every FS block de-stage path:
> IF thin {
> tickle block layer to allocate the block (aka write zeros to it? - what
> about pre-existing data, is there a "fake write" BIO call that does
> everything but actually write data to a block but would otherwise trigger
> LVM thin's extent allocation logic?)
> IF success, destage dirty block to block layer ELSE
> inform userland of ENOSPC
> }
>
> In a fully journal'd FS (metadata AND data) the journal could be 'pinned'
> and likewise the main metadata areas if for no other reason they are zero'd
> at onset and or constantly being written to. Once written to, LVM thin
> isn't going to go back and yank away an allocated extent.
>
Yes. This is exactly the type of solution I was thinking of including
pinning the journal! You used the correct terminology. I can read the terms
but not write them. :-)
You also managed to summarize it in only a few lines of text. As concepts
go, I think that makes it not-too-complex.
But, the devil is often in the details, and you are right that this is a
per-file system cost.
Balancing this, however, I am perhaps presuming that *all* systems will
eventually be thin volume systems, and that correct behaviour and highly
available behaviour will eventually require that *all* systems invest in
technology such as this. My view of the future is that fixed sized thick
partitions are very often a solution which is compromised from the start.
Most systems of significance grow over time, and the pressure to reduce
cost is real. I think we are taking baby steps to start, but that the
systems of the future will be thin volume systems. I see this as a problem
that needs to be understood and solved, except in the most limited of use
cases. This is my opinion, which I don't expect anybody to share.
> This at least should maintain FS integrity albeit you may end up in a
> situation where the journal can never get properly de-staged, so you're
> stuck on any further writes and need to force RO.
>
Interesting to consider. I don't see this as necessarily a problem - or
that it necessitates "RO" as a persistent state. For example, it would be
most practical if sufficient room was reserved to allow for content to be
removed, allowing for the file system to become unwedged and become "RW"
again. Perhaps there is always an edge case that would necessitate a
persistent "RO" state that requires the volume be extended to recover from,
but I think the edge case could be refined to something that will tend to
never happen?
> > just want a sanely behaving LVM + XFS...)
> IMO if the system admin made a conscious decision to use thin AND
> overprovision (thin by itself is not dangerous), it's up to HIM to actively
> manage his block layer. Even on million dollar SANs the expectation is that
> the engineer will do his job and not drop the mic and walk away. Maybe the
> "easiest" implementation would be a MD layer job that the admin can tailor
> to fail all allocation requests once extent count drops below a number and
> thus forcing all FS mounted on the thinpool to go into RO mode.
>
Another interesting idea. I like the idea of automatically shutting down
our applications or PostgreSQL database if the thin pool reaches an unsafe
allocation, such as 90% or 95%. This would ensure the integrity of the
data, at the expense of an outage. This is something we could implement
today. Thanks.
> But in any event it won't prevent irate users from demanding why the space
> they appear to have isn't actually there.
Users will always be irate. :-) I mostly don't consider that as a real
factor in my technical decisions... :-)
Thanks for entertaining this discussion, Matthew and Zdenek. I realize this
is an open source project, with passionate and smart people, whose time is
precious. I don't feel I have the capability of really contributing code
changes at this time, and I'm satisfied that the ideas are being considered
even if they ultimately don't get adopted. Even the mandatory warning about
snapshots exceeding the volume group size is something I can continue to
deal with using scripting and filtering. I mostly want to make sure that my
perspective is known and understood.
--
Mark Mielke <mark.mielke@gmail.com>
[-- Attachment #2: Type: text/html, Size: 9191 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 13:01 ` matthew patton
2016-05-03 15:47 ` Xen
@ 2016-05-04 0:56 ` Mark Mielke
1 sibling, 0 replies; 29+ messages in thread
From: Mark Mielke @ 2016-05-04 0:56 UTC (permalink / raw)
To: matthew patton, LVM general discussion and development
[-- Attachment #1: Type: text/plain, Size: 6474 bytes --]
On Tue, May 3, 2016 at 9:01 AM, matthew patton <pattonme@yahoo.com> wrote:
> On Mon, 5/2/16, Mark Mielke <mark.mielke@gmail.com> wrote:
> <quote>
> very small use case in reality. I think large service
> providers would use Ceph or EMC or NetApp, or some such
> technology to provision large amounts of storage per
> customer, and LVM would be used more at the level of a
> single customer, or a single machine.
> </quote>
>
> Ceph?!? yeah I don't think so.
>
I don't use Ceph myself. I only listed it as it may be more familiar to
others, and because I was responding to a Red Hat engineer. We use NetApp
and EMC for the most part.
> If you thin-provision an EMC/Netapp volume and the block device runs out
> of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE.
> They don't even go RO. Poof, they disappear. Why? Because there is no
> guarantee that every NFS client, every iSCSI client, every FC client is
> going to do the right thing. The only reliable means of telling everyone
> "shit just broke" is for the asset to disappear.
>
I think you are correct. Based upon experience, I don't recall this ever
happening, but upon reflection, it may just be that our IT team always
caught the situation before it became too bad, and either extended the
storage, or asked permission to delete snapshots.
> All in-flight writes to the volume that the array ACK'd are still good
> even if they haven't been de-staged to the intended device thanks to NVRAM
> and the array's journal device.
>
Right. A good feature. An outage occurs, but the data that was properly
written stays written.
<quote>
> In these cases, I
> would expect that LVM thin volumes should not be used across
> multiple customers without understanding the exact type of
> churn expected, to understand what the maximum allocation
> that would be required.
> </quote>
>
> sure, but that spells responsible sysadmin. Xen's post implied he didn't
> want to be bothered to manage his block layer that magically the FS' job
> was to work closely with the block layer to suss out when it was safe to
> keep accepting writes. There's an answer to "works closely with block
> layer" - it's spelled BTRFS and ZFS.
>
I get a bit lost here in the push towards BTRFS and ZFS for people with
these expectations as I see BTRFS and ZFS as having a similar problem. They
can both still fill up. They just might get closer to 100% utilization
before they start to fail.
My use case isn't about reaching closer to 100% utilization. For example,
when I first proposed our LVM thinp model for dealing with host-side
snapshots, there were people in my team that felt that "fstrim" should be
run very frequently (even every 15 minutes!), so as to make maximum use of
the available free space across multiple volumes and reduce churn captured
in snapshots. I think anybody with this perspective really should be
looking at BTRFS or ZFS. Myself, I believe fstrim should run once a week or
less, and not really to save space, but more to hint to the flash device
which blocks are definitely not in use over time, to make the best use of
the flash storage over time. If we start to pass 80%, I raise the alarm
that we need to consider increasing the local storage, or moving more
content out of the thin volumes. Usually we find out that more-than-normal
churn occurred, and we just need to prune a few snapshots to drop below 50%
again. I still made them move the content that doesn't need to be snapshot
out of the thin volume, and to a stand-alone LVM thick volume so as to
entirely eliminate this churn from being trapped in snapshots and
accumulating.
LVM has no obligation to protect careless sysadmins doing dangerous things
> from themselves. There is nothing wrong with using THIN every which way you
> want just as long as you understand and handle the eventuality of extent
> exhaustion. Even thin snaps go invalid if it needs to track a change and
> can't allocate space for the 'copy'.
>
Right.
> Amazon would make sure to have enough storage to meet my requirement if I
> need them.
>
> Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools
> to manage the fact they are thin-provisoning and to make damn sure they can
> cash the checks they are writing.
>
Right.
> > the nature of the block device, such as "how much space
> > do you *really* have left?"
>
> So you're going to write and then backport "second guess the block layer"
> code to all filesystems in common use and god knows how many versions back?
> Of course not. Just try to get on the EXT developer mailing list and ask
> them to write "block layer second-guessing code (aka branch on device
> flag=thin)" because THINP will cause problems for the FS when it runs out
> of extents. To which the obvious and correct response will be "Don't use
> THINP if you're not prepared to handle it's pre-requisites."
>
Bad things happen. Sometimes they happen very quickly. I don't intend to
dare fate, but if fate comes knocking, I prefer to be prepared. For
example, we had two monitoring systems in place for one particularly
critical piece of storage, where the application is particularly poor at
dealing with "out of space". No thin volumes in use here. Thick volumes all
the way. The system on the storage appliance stopped sending notifications
a few weeks prior as a result of some mistake during a reconfiguration or
upgrade. The separate monitoring system using entirely different software
and configuration, on different host, also failed for a different reason
that I no longer recall. The volume became full, and the application data
was corrupted in a bad way that required recovery. My immediate reaction
after best addressing the corruption, was to demand three monitoring
systems instead of two. :-)
> > you and the other people. You think the block storage should
> > be as transparent as possible, as if the storage was not
> > thin. Others, including me, think that this theory is
> > impractical
> Then by all means go ahead and retrofit all known filesystems with the
> extra logic. ALL of the filesystems were written with the understanding
> that the block layer is telling the truth and that any "white lie" was
> benign in so much that it would be made good and thus could be assumed to
> be "truth" for practical purpose.
I think this relates more closely to your other response, that I will
respond to separately...
--
Mark Mielke <mark.mielke@gmail.com>
[-- Attachment #2: Type: text/html, Size: 8607 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
@ 2016-05-03 18:19 Xen
0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 18:19 UTC (permalink / raw)
To: LVM general discussion and development
Zdenek Kabelac schreef op 03-05-2016 17:45:
> It's not 'continued' suggestion.
>
> It's just the example of solution where 'filesystem & block layer'
> are tied together. Every solution has some advantages and
> disadvantages.
So what if more systems were tied together in that way? What would be
the result?
Tying together does not have to do away with layers.
It is not either/or, it is both/and.
You can have separate layers and you can have intregration.
In practice all it would require is for the LVM, ext and XFS people to
agree.
You could develop extensions to the existing protocols that are only
used if both parties understand it.
Then pretty much btrfs has no raison d'être anymore. You would have an
integrated system but people can retain their own identities as much as
they want.
From what you say LVM+ext4/XFS is already a partner system anyway.
It is CLEAR LVM+BTRFS or LVM+ZFS is NOT a popular system.
You can and you could but it does not synergize. OpenSUSE uses btrfs by
default and I guess they use LVM just as well. For LVM you want a
simpler filesystem that does its own work.
(At the same time I am not so happy with the RAID capability of LVM, nor
do I care much at this point).
LVM raid seems for me the third solution after firmware raid, regular
dmraid and .... and that.
I prefer to use LVM on top of raid really. But maybe that's not very
helpful.
> So far I'm convinced layered design gives user more freedom - for the
> price
> of bigger space usage.
Well let's stop directing people to btrfs then.
Linux people have a tendency and habit to send people from pillar to
post.
You know what that means.
It means 50% of answers you get are redirects.
They think it's efficient to spend their time redirecting you or wasting
your time in other ways, rather than using the same time and energy
answering your question.
If the social Linux system was a filesystem, people would run benchmarks
and complain that its organisation is that of a lunatic.
Where 50% of read requests get directed to another sector, of which 50%
again get redirected, and all for no purpose really.
Write requests get 90% deflected. The average number of write requests
before you hit your target is about ... it converges exactly to 10.
If I had been better at math I would have known that :p.
You say:
"Please don't compare software to real life".
No, let's compare the social world to technology. We have very bad
technology if you look at it like that. Which in turn doesn't make the
"real" technology much better.
SUM( i * p * (1-p)^(i-1) ) with i = (1, inf) = 1/p.
with p a chance of success at each hit.
The sum of that formula with i iterating from 1 to infinite is 1/p.
With a hit chance of 90% per attempt, the average number of hits to be
successful is 1/.9 = 10/9.
I'm not very brilliant today.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 13:01 ` matthew patton
@ 2016-05-03 15:47 ` Xen
2016-05-04 0:56 ` Mark Mielke
1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 15:47 UTC (permalink / raw)
To: LVM general discussion and development
matthew patton schreef op 03-05-2016 15:01:
> Ceph?!? yeah I don't think so.
Mark's argument was nothing about comparing feature sets or something at
this point. So I don't know what you are responding to. You respond like
a bitten bee.
Read again. Mark Mielke described actual present-day positions. He
described what he thinks is how LVM is positioning itself in conjunction
with and with regards to other solutions in industry. He described that
to his mind the bigger remote storage solutions do not or would not
easily or readily start using LVM for those purposes, while the smaller
scale or more localized systems would.
He described a layering solution, that you seem to be allergic to. He
described a modularized system where thin is being used both at the
remote backend (using a different technology) and at the local end
(using LVM) for different purposes but achieving much of the same
results.
He described how he considered the availability of the remote pool a
responsibility for that remote supplier (and paying good money for it)
while having different uses cases for LVM thin himself or themselves.
And I did think he made a very good case for this. I absolutely believe
his use case is the most dominant and important one for LVM. LVM is for
local systems.
In this case it is a local system running storage on a remote backend.
Yet the local system has different requirements and uses LVM thin for a
different purpose.
And this purpose falls along the lines of having cheap and freely
available snapshots.
And he still feels and believes, apparently, that using the LVM admin
tools for ensuring the stability of his systems might not be the most
attractive and functional thing to do.
You may not agree with that but it is what he believes and feels. It is
a real life data point, if you care about that.
Sometimes people's opinions actually simply just inform you of the
world. It is information. It is not something to fight or disagree with,
it is something to take note of.
The better you are able to respond to these data points, the better you
are aware of the system you are dealing with. That could be real people
paying or not paying you money.
However if you are going to fight every opinion that disagrees with you,
you will never get to the point of actually realizing that they are just
opinions and they are a wealth of information if you'd make use of it.
And that is not a devious thing to do if you're thinking that. It is
being aware. Nothing more, nothing less.
And we are talking about awareness here. Not surprising then that the
people most vehemently opposing this also seem to be the people least
aware of the fact that real people with real uses cases might find the
current situation not practical.
Mr. Zdenek can say all he wants that the current situation is very
practical.
If that is not a data point but an opinion (not of someone experiencing
it, but someone who wants certain people to experience certain things)
then we must listen to actual data points and not what he wants.
Mr. Zdenek (I haven't responded to him here now) also responds like a
bitten bee to simple allusions that Red Hat might be thinking this or
that.
Not just stung by a bee. A bee getting stung ;-).
I mean come on people. You have nothing to lose. Either it is a good
idea or it isn't. If it gets support, maybe someone will implement it
and deliver proof of concept. But if you go about shooting it down the
moment it rears its ugly (or beautiful) head you also ensure that that
developer time is not going to be spend on it even if it were an asset
to you.
Someone discussing a need might not always be that person that in the
end is not going to do anything himself.
You are trying to avoid work but in doing so you avoid work being done
for you as well.
It's give or take, it's plus plus.
Don't kill other people's ideas and maybe they start doing work for you
too.
Oh yeah. Sorry if I'm being judgmental or belligerent (or pedantic):
The great irony and tragedy of the Linux world is this:
Someone comes with a great idea that he/she believes in and wants to
work on.
They shoot it down.
Next they complain why there are so very few volunteers.
They can ban someone on a mailing list one instant and out loud wonder
how they can attract more interest to their system, the next.
Not unrelated.
> sure, but that spells responsible sysadmin. Xen's post implied he
> didn't want to be bothered to manage his block layer that magically
> the FS' job was to work closely with the block layer to suss out when
> it was safe to keep accepting writes. There's an answer to "works
> closely with block layer" - it's spelled BTRFS and ZFS.
It is not my block layer. I'm not the fucking system admin.
I can only talk to the FS. Or that might very well be the case for my
purposes here.
It is pretty amazing that any attempt to separate responsibilities in
actuality is met with a rebuttal that insists one use a solution that
mingles everything.
In your ideal world then, everyone is forced to use BTRFS/ZFS because at
least these take the worries away from the software/application
designer.
And you ensure a beautiful world without LVM because it has no purpose.
As as software developer I cannot depend on your magical solution and
assertion that every admin out there is going to be this amazing person
that never makes a mistake.
> Responsible usage has nothing to do with single vs multiple customers.
> Though Xen broached the 'hosting' example and in the cut-rate hosting
> business over-provisioning is rampant. It's not a problem unless the
> syadmin drops the ball.
What if I want him to be able to drop the ball and still survive?
What about designing systems that are actually failsafe and resilient?
What about resilience?
What about goodness?
What about quality?
What about good stuff?
Why do you feed your admins bad stuff just so that they can shine and
consider themselves important?
> So you're going to write and then backport "second guess the block
> layer" code to all filesystems in common use and god knows how many
> versions back? Of course not. Just try to get on the EXT developer
> mailing list and ask them to write "block layer second-guessing code
> (aka branch on device flag=thin)" because THINP will cause problems
> for the FS when it runs out of extents. To which the obvious and
> correct response will be "Don't use THINP if you're not prepared to
> handle it's pre-requisites."
So you are basically suggesting a solution that you know will fail, but
still you recommend it.
That spells out "I don't know how to achieve my goals" like no other
thing.
But you still think people should follow your recommendations.
What you say is completely anemic to how the open source world works.
You do not ask people to do your work for you.
Why do you even insist on recommending that. And then when you (in your
imagination here) do ask those people to do it for you, they refuse. No
small wonder.
Still you consider that a good way to approach things. To depend on
someone else to do your work for you.
Really.
"Of course not. Just try to get on the EXT developer mailing list and
ask them to..."
Yes I am ridiculing you.
You were sincere in saying those words. You ridicule yourself.
Of course you would start designing patches and creating a workable
solution with yourself as the main leader or catalyst of that project.
There is not other way to do things in life. You should know that.
> Then by all means go ahead and retrofit all known filesystems with the
> extra logic. ALL of the filesystems were written with the
> understanding that the block layer is telling the truth and that any
> "white lie" was benign in so much that it would be made good and thus
> could be assumed to be "truth" for practical purpose.
Maybe we should also retrofit all unknown filesystems and those that
might be designed on different planets. Yeah, that would be a good way
to approach things.
I really want to follow your recommendations here. If I do, I will have
good chances of achieving success.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 13:15 ` Gionatan Danti
@ 2016-05-03 15:45 ` Zdenek Kabelac
0 siblings, 0 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 15:45 UTC (permalink / raw)
To: LVM general discussion and development
On 3.5.2016 15:15, Gionatan Danti wrote:
> On 03/05/2016 13:42, Zdenek Kabelac wrote:
>>
>> What's wrong with 'lvs'?
>> This will give you the available space in thin-pool.
>>
>
> Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example of
> the block device/layer exposing some (lack of) features to upper layer.
>
> One note about the continued "suggestion" to use BTRFS. While for relatively
It's not 'continued' suggestion.
It's just the example of solution where 'filesystem & block layer' are tied
together. Every solution has some advantages and disadvantages.
> simple use case it can be ok, for more demanding (rewrite-heavy) scenarios
> (eg: hypervisor, database, ecc) it performs *really* bad, even when "nocow" is
> enabled.
So far I'm convinced layered design gives user more freedom - for the price
of bigger space usage.
>
> Anyway, ThinLVM + XFS is an extremely good combo in my opinion.
>
Yes, thought ext4 is quite good as well...
Zdenek
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 12:00 ` matthew patton
@ 2016-05-03 14:38 ` Xen
2016-05-04 1:25 ` Mark Mielke
1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 14:38 UTC (permalink / raw)
To: LVM general discussion and development
Just want to respond to this just to make things clear.
matthew patton schreef op 03-05-2016 14:00:
> why all of a sudden does each and every FS have to have this added
> code to second guess the block layer? The quickest solution is to
> mount the FS in sync mode. Go ahead and pay the performance piper.
> It's still not likely to be bullet proof but it's a sure step closer.
Why would anyone do what you don't want to do. Don't suggest solutions
you don't even want yourself. That goes for all of you (Zdenek mostly).
And it is not second guessing. It is second guessing what it is doing
currently. If you have actual information from the block layer, you
don't NEED to second guess.
Isn't that obvious?
> What you're saying is that when mounting a block device the layer
> needs to expose a "thin-mode" attribute (or the sysdmin sets such a
> flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid
> mode geometry (does that actually work reliably?).
Not necessarily. It could be transparent if these were actual available
features as part of a feature set. The features would individually be
able to be turned on and off, not necessarily calling it "thin".
> Then there has to be code in every FS block de-stage path:
> IF thin {
> tickle block layer to allocate the block (aka write zeros to it? -
> what about pre-existing data, is there a "fake write" BIO call that
> does everything but actually write data to a block but would otherwise
> trigger LVM thin's extent allocation logic?)
> IF success, destage dirty block to block layer ELSE
> inform userland of ENOSPC
> }
What Mark suggested is not actually so bad. Preallocating means you have
to communicate in some way to the user that space is going to run out.
My suggestion would have been and still is in that sense to simply do
this by having the filesystem update the amount of free space.
> This at least should maintain FS integrity albeit you may end up in a
> situation where the journal can never get properly de-staged, so
> you're stuck on any further writes and need to force RO.
I'm glad you think of solutions.
> IMO if the system admin made a conscious decision to use thin AND
> overprovision (thin by itself is not dangerous)
Again, that is just nonsense. There is not a person alive who wants to
use thin for something that is not overprovisioning, whether it be
snapshots or client sharing.
You are trying to get away with "hey, you chose it! now sucks if we
don't actually listen to you! hahaha."
SUCKER!!!!.
No, the primary use case for thin is overprovisioning.
> , it's up to HIM to
> actively manage his block layer.
Block layer doesn't come into play with it.
You are separating "main admin task" and "local admin task".
What I mean is that there are different roles. Even if they are the same
person, they are different tasks.
Someone writing software, his task is to ensure his software keeps
working given failure conditions.
This software writer, even if it is the same person, cannot be expected
to at that point be thinking of LVM block allocation. These are
different things.
You communicate with the layers you communicate with. You don't go
around that.
When you write a system that is supposed to be portable, for instance,
you do not start depending on other features, tools or layers that are
out of reach the moment your system or software is deployed somewhere
else.
Filesystem communication is available to all applications. So any
application designed for a generic purpose of installment is going to be
wanting to depend on filesystem tools, not block layer tools.
You people apparently don't understand layering very well OR you would
never recommend avoiding an intermediate layer (the filesystem) to go
directly to the lower level (the block layer) for ITS admin tools.
I mean are you insane. You (Zdenek mostly) is so much about not mixing
layers but then it is alright to go around them?
A software tool that is meant to be redeployable and should be able to
depend on a minimalist set of existing features in the direct layer it
is interfacing with, but still wants to use whatever is available given
circumstances that dictate that it wouldn't harm its redeployability,
would never choose the acquire and use the more remote and more
uncertain set (such as LVM) when it could also be using directly
available measures (such as free disk space, as a crude measure) that
are available on ANY system provided that yes indeed, there is some
level of sanity to it.
If you ARE deployed on thin and the filesystem cannot know about actual
space then you are left in the dark, you are left blind, and there is
nothing you can do as a systems programmer.
> Even on million dollar SANs the
> expectation is that the engineer will do his job and not drop the mic
> and walk away.
You constantly focus on the admin.
With all of this hotshot and idealist behaviour about layers you are
espousing, you actually advocate going around them completely and using
whatever deepest-layer or most-impact solution that is available (LVM)
in order to troubleshoot issues that should be handled by interfacing
with the actual layer you always have access to.
It is not just about admins. You make this about admins as if they are
solely responsible for the entire system.
> Maybe the "easiest" implementation would be a MD layer job that the
> admin can tailor to fail all allocation requests once
> extent count drops below a number and thus forcing all FS mounted on
> the thinpool to go into RO mode.
A real software engineer doesn't go for the easiest solution or
implementation. I am not approaching this from the perspective of an
admin exclusively. I am also and most and more importantly a software
programmer that wants to use systems that are going to work regardless
of the pecularities of an implementation or system I have to work on ,
and I don't leave it to the admin of said system to do all my tasks.
As a programmer I cannot decide that the admin is going to be a perfect
human being like so you well try to believe in, because that's what you
think you are: you are that amazing admin that never fails taking
account of available disk space.
But that's a moron position.
If I am to write my software, I cannot depend on bigger-scale or
outer-level solutions to always be in place. I cannot offload my
responsibilities to the admin.
You are insisting here that layers (administration layers and tasks) are
mixed and completely destroyed, all in the sense of not doing that to
the software itself?
Really?
Most importantly if I write any system that cannot depend on LVM being
present, then NO THOSE TOOLS ARE NOT AVAILABLE TO ME.
"Why don't you just use LVM?" well fuck off.
I am not that admin. I write his system. I don't do his work.
Yet I still have the responsibility that MY component is going to work
and not give HIM headaches. That's real life for you.
Even if in actually I might be imprisoned with broken feet and arms. I
still care about this and I still do this work in a certain sense.
And yes I utterly care about modularity in software design. I understand
layers much better than you do if you are able or even capable of
suggestion such solutions.
Communication between layers does not necessarily integrate the layers
if those interfaces are well defined and allow for modular "changing" of
the chose solution.
I recognise full well that there is integration and that you do get a
working together. But that is the entire purpose of it. To get the two
things to work together more. But that is the whole gist of having
interfaces and APIs in the first place.
It is for allowing stuff to work together to achieve a higher goal than
they could achieve if they were just on their own.
While recognising where each responbility lies.
BLOCK LAYER <----> BLOCK LAYER ADMIN
FILESYSTEM LAYER <----> FILESYSTEM LAYER ADMIN
APPLICATION LAYER <---> APPLICATON WRITER.
Me, the application writer, cannot be expected to deal with number one,
the block layer.
At the same time I need tools to do my work. I also cannot go to any
random block layer admin my system might get deployed on (who's to say I
will be there?) and beg for him to spend ample amount of time designing
his systems from scratch so that even if my software fails, it won't
hurt anyone.
But without information on available space I might not be able to do
anything.
Then what happens is that I have to design for this uncertainty.
Then what happens is that I (with capital IIIII) start allocating space
in advance as a software developer making applications for systems that
might I don't know, run on banks or whatever. Just saying something.
Yes now this task is left to the software designer making the
application.
Now I have to start allocating buffers to ensure graceful shutdown or
termination, for instance.
I might for instance allocate a block file, and if writes to the
filesystem start to fail or the filesystem becomes read-only, I might
still be in trouble not being able to write to it ;-). So I might start
thinking about kernel modules that I can redeploy with my system that
ensure graceful shutdown or even continued operation. I might decide
that files mounted as loopback are going to stay writable even if the
filesystem they reside on is now readonly. I am going to ensure these
are not sparse blocks and that the entire file is written to and grown
in advance, so that my writes start to look like real block device
writes. Then I'm just going to just patch the filesystem or the VFS to
allow writes to these files even if it comes with a performance hit of
additional checks.
And that hopefully not the entire volume gets frozen by LVM.
But that the kernel or security scripts just remount it ro.
That is then the best way solution for my needs in that circumstance.
Just saying you know.
It's not all exclusively about admins working with LVM directly.
> But in any event it won't prevent irate users from demanding why the
> space they appear to have isn't actually there.
If that is your life I feel sorry for you.
I just do.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 11:42 ` Zdenek Kabelac
@ 2016-05-03 13:15 ` Gionatan Danti
2016-05-03 15:45 ` Zdenek Kabelac
0 siblings, 1 reply; 29+ messages in thread
From: Gionatan Danti @ 2016-05-03 13:15 UTC (permalink / raw)
To: LVM general discussion and development
On 03/05/2016 13:42, Zdenek Kabelac wrote:
>
> Danger with having 'disable' options like this is many distros do decide
> themselves about best defaults for their users, but Ubuntu with their
> issue_discards=1 shown us to be more careful as then it's not Ubuntu but
> lvm2 which is blamed for dataloss.
>
> Options are evaluated...
>
Very true. "Sane defaults" is one of the reason why I (happily) use
RHEL/CentOS as hypervisors and other critical tasks.
>
>
> What's wrong with 'lvs'?
> This will give you the available space in thin-pool.
>
Oh, absolutely nothing wrong with lvs. I used "lsblk" only as an example
of the block device/layer exposing some (lack of) features to upper layer.
One note about the continued "suggestion" to use BTRFS. While for
relatively simple use case it can be ok, for more demanding
(rewrite-heavy) scenarios (eg: hypervisor, database, ecc) it performs
*really* bad, even when "nocow" is enabled.
I had much more fortune, performance wise, with ZFS. Too bad ZoL is an
out-of-tree component (albeit very easy to install and, in my
experience, quite stable also).
Anyway, ThinLVM + XFS is an extremely good combo in my opinion.
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
[not found] <1614984310.1700582.1462280490763.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-03 13:01 ` matthew patton
2016-05-03 15:47 ` Xen
2016-05-04 0:56 ` Mark Mielke
0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-05-03 13:01 UTC (permalink / raw)
To: LVM general discussion and development
On Mon, 5/2/16, Mark Mielke <mark.mielke@gmail.com> wrote:
<quote>
very small use case in reality. I think large service
providers would use Ceph or EMC or NetApp, or some such
technology to provision large amounts of storage per
customer, and LVM would be used more at the level of a
single customer, or a single machine.
</quote>
Ceph?!? yeah I don't think so.
If you thin-provision an EMC/Netapp volume and the block device runs out of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE. They don't even go RO. Poof, they disappear. Why? Because there is no guarantee that every NFS client, every iSCSI client, every FC client is going to do the right thing. The only reliable means of telling everyone "shit just broke" is for the asset to disappear.
All in-flight writes to the volume that the array ACK'd are still good even if they haven't been de-staged to the intended device thanks to NVRAM and the array's journal device.
<quote>
In these cases, I
would expect that LVM thin volumes should not be used across
multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation
that would be required.
</quote>
sure, but that spells responsible sysadmin. Xen's post implied he didn't want to be bothered to manage his block layer that magically the FS' job was to work closely with the block layer to suss out when it was safe to keep accepting writes. There's an answer to "works closely with block layer" - it's spelled BTRFS and ZFS.
LVM has no obligation to protect careless sysadmins doing dangerous things from themselves. There is nothing wrong with using THIN every which way you want just as long as you understand and handle the eventuality of extent exhaustion. Even thin snaps go invalid if it needs to track a change and can't allocate space for the 'copy'.
Responsible usage has nothing to do with single vs multiple customers. Though Xen broached the 'hosting' example and in the cut-rate hosting business over-provisioning is rampant. It's not a problem unless the syadmin drops the ball.
> Amazon would make sure to have enough storage to meet my requirement if I need them.
Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools to manage the fact they are thin-provisoning and to make damn sure they can cash the checks they are writing.
Â
> the nature of the block device, such as "how much space
> do you *really* have left?"
So you're going to write and then backport "second guess the block layer" code to all filesystems in common use and god knows how many versions back? Of course not. Just try to get on the EXT developer mailing list and ask them to write "block layer second-guessing code (aka branch on device flag=thin)" because THINP will cause problems for the FS when it runs out of extents. To which the obvious and correct response will be "Don't use THINP if you're not prepared to handle it's pre-requisites."
> you and the other people. You think the block storage should
> be as transparent as possible, as if the storage was not
> thin. Others, including me, think that this theory is
> impractical
Then by all means go ahead and retrofit all known filesystems with the extra logic. ALL of the filesystems were written with the understanding that the block layer is telling the truth and that any "white lie" was benign in so much that it would be made good and thus could be assumed to be "truth" for practical purpose.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-29 11:23 ` Zdenek Kabelac
2016-05-02 14:32 ` Mark Mielke
@ 2016-05-03 12:42 ` Xen
1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-05-03 12:42 UTC (permalink / raw)
To: LVM general discussion and development
Zdenek Kabelac schreef op 29-04-2016 13:23:
> I'm not going to add much to this thread - since there is nothing
> really useful for devel. But let me strike out few important moments:
If you like to keep things short now I will give short replies. Also
other people have responded and I haven't read everything yet.
> it's still the admin who creates thin-volume and gets WARNING if VG is
> not big enough when
> all thin volumes would be fully provisioned.
That is just what we could call insincere or that beautiful strange word
that I cannot remember.
The opposite of innocuous. Disingenuous (thank you dictionary).
You know perfectly well that this warning doesn't do much of anything
when all people approach thin from the view point of wanting to
overprovision.
That is like saying "Don't enter this pet store here, because you might
buy pets, and pets might scratch your arm. Now what can we serve you
with?".
It's those insincere warnings many business or ideas give to people to
supposedly warn them in advance of what they want to do anyway. "I told
you it was a bad idea, now what can we do for you? :) :) :) :)". It's a
way of being politically correct mostly.
You want to do it anyway. But now someone tells you it might be a bad
idea even if both of you want it.
> So you try to design 'another btrfs' on top of thin provisioning?
Maybe I am. At least you recognise that I am trying to design something,
many people would just throw it in the wastebasket with "empty
complains".
That in itself.... ;-)
speaks some volumes.
But let's talk about real volumes now :p.
There's nothing bad about btrfs except that it usurps everything,
doesn't separate any layers, and just overall means the end and death of
a healthy filesystem system. It wants to be the monopoly.
> With 'thinp' you want simplest filesystem with robust metadata - so
> in theory - 'ext4' or XFS without all 'improvements for rotational
> hdd that has accumulated over decades of their evolution.
I agree. I don't even use ext4, I use ext3. I feel ext4 may have some
benefits but they are not really worth anything.
> You miss the 'key' details.
>
> Thin pool is not constructing 'free-maps' for each LV all the time -
> that's why tools like 'thin_ls' are meant to be used from the
> user-space.
> It IS very EXPENSIVE operation.
>
> So before you start to present your visions here, please spend some
> time with reading doc and understanding all the technology behind it.
Sure I could do that. I could also allow myself to die without ever
having contributed to anything.
>> Even with a perfect LVM monitoring tool, I would experience a
>> consistent
>> lack of feedback.
>
> Mistake of your expectations
It is nothing to do with expectations. Things and feeling that keep
creeping up to you and keep annoying you have nothing to do with
expectations.
That is like being thoroughly annoyed about something for years and
expecting it to go away by itself, is the epitome of sanity.
For example: monitor makes buzzing noise when turned off. Deeply
frustrating, annoying, downright bad. Gives me nightmares even. You say
"You have bad expectations of hardware, hardware just does that thing,
you have to live with it." I go to shop, shop says "Yeah all hardware
does that (so we don't need to pay you anything back)".
That has nothing to do with bad expectations.
> If you are trying to operate thin-pool near 100% fullness - you will
> need to write and design completely different piece of software -
> sorry thinp
> is not for you and never will...
I am not trying to operate near 100% fullness.
Although it wouldn't be bad if I could manage that.
That would not be such a bad thing at all. If the tools where there to
actually do it, and the mechanisms. Wouldn't you agree? Regardless of
what is possible or even what is to be considered "wise" here, wouldn't
it be beneficial in some way?
>
>>
>> Just a simple example: I can adjust "df" to do different stuff. But
>> any
>> program reporting free diskspace is going to "lie" to me in that
>> sense. So
>> yes I've chosen to use thin LVM because it is the best solution for me
>> right now.
>
> 'df' has nothing in common with 'block' layer.
A clothing retailer has nothing in common with a clothing manufacturer
either, but they are just both in the same business.
> But if you've never planned to buy 10TB - you should have never allow
> to create such big volume in the first place!
So you are like saying the only use case of thin is a growth scenario
that can be met.
> So don't do it - and don't plan to use it - it's really that simple.
What I was saying was that it would be possible to maintain the contract
that any individual volume at any one time would be able to grow to max
size as long as other volumes don't start acting aberrant. If you manage
all those volumes of course you would be able to choose this.
The purpose of the thin system is to maintain the situation that all
volumes can reach their full potential without (auto)extending, in that
sense.
If you did actually make a 1TB volume for a single client with a 10TB
V-size, you would be a very bad contractor. Who says it is not going to
happen overnight? How will you be able to respond.
The situation where you have a 10TB volume and you have 20 clients with
1TB each, is very different.
I feel the contract should be that the available real space should
always be equal to or greater than the available on any one filesystem
(volume).
So: R >= max(A(1), A(2), A(3), ..., A(n))
Of course it is pleasant not having to resize the filesystem but would
you really do that for yourself? Make a 10TB filesystem on a 1TB disk as
you expect to buy more disks in the future?
I mean you could. But in this sense resizing the filesystem (growing it)
is not a very expensive operation, usually.
I would only want to do that if I could limit the actual usage of the
filesystem in a real way.
Any runaway process causing my volume to drop...... NOT a good thing.
> Actually it's the core principle!
> It lies (or better say uses admin's promises) that there is going to
> be a disk space. And it's admin responsibility to fulfill it.
The admin never comes into it. What the admin does or doesn't do, what
the admin thinks or doesn't think. These are all interpretations of
intents.
Thinp should function regardless of what the admin is thinking or not.
Regardless of what his political views are.
You are bringing morality into the technical system.
You are saying /thinp should work/ because /the admin should be a good
person/.
When the admin creates the system, no "promise" is ever communicated to
the hardware layer, OR the software layer. You are turning the correct
operation of the machine into a human problem in the way of saying
"/Linux is a great system and everyone can use it, but some people are
just too stupid to spend a few hours reading a manual on a daily basis,
and we can't help that/".
These promises are not there in the system. Someone might be using the
system for reasons you have not envisioned. But the system is there and
it can be used for it. Now if things go wrong you say "You you had the
wrong use case" but a use case is just a use case, it has no morality to
it.
If you build a waterway system that only functions as long as it doesn't
rain (overflowing the channels) then you can say "Well my system is
perfect, it is just God who is a bitch and messes things up".
No you have to take account of real life human beings, not those ideal
pictures of admins that you have.
Stop the idealism you know. Admins are humans and they can be expected
to be humans.
It is you who have wrong expectations of people.
If people mess up they mess up but it is part of the human agenda and
you design for that.
> If you know in front you will need quickly all the disk space - then
> using thinp and expecting miracle is not going to work.
Nobody ever said anything of that kind.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
[not found] <1870050920.5354287.1462276845385.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-03 12:00 ` matthew patton
2016-05-03 14:38 ` Xen
2016-05-04 1:25 ` Mark Mielke
0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-05-03 12:00 UTC (permalink / raw)
To: LVM general discussion and development
> written as required. If the file system has particular areas
> of importance that need to be writable to prevent file
> system failure, perhaps the file system should have a way of
> communicating this to the volume layer. The naive approach
> here might be to preallocate these critical blocks before
> proceeding with any updates to these blocks, such that the
> failure situations can all be "safe" situations,
> where ENOSPC can be returned without a danger of the file
> system locking up or going read-only.
why all of a sudden does each and every FS have to have this added code to second guess the block layer? The quickest solution is to mount the FS in sync mode. Go ahead and pay the performance piper. It's still not likely to be bullet proof but it's a sure step closer.
What you're saying is that when mounting a block device the layer needs to expose a "thin-mode" attribute (or the sysdmin sets such a flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry (does that actually work reliably?).
Then there has to be code in every FS block de-stage path:
IF thin {
tickle block layer to allocate the block (aka write zeros to it? - what about pre-existing data, is there a "fake write" BIO call that does everything but actually write data to a block but would otherwise trigger LVM thin's extent allocation logic?)
IF success, destage dirty block to block layer ELSE
inform userland of ENOSPC
}
In a fully journal'd FS (metadata AND data) the journal could be 'pinned' and likewise the main metadata areas if for no other reason they are zero'd at onset and or constantly being written to. Once written to, LVM thin isn't going to go back and yank away an allocated extent.
This at least should maintain FS integrity albeit you may end up in a situation where the journal can never get properly de-staged, so you're stuck on any further writes and need to force RO.
> just want a sanely behaving LVM + XFS...)
IMO if the system admin made a conscious decision to use thin AND overprovision (thin by itself is not dangerous), it's up to HIM to actively manage his block layer. Even on million dollar SANs the expectation is that the engineer will do his job and not drop the mic and walk away. Maybe the "easiest" implementation would be a MD layer job that the admin can tailor to fail all allocation requests once extent count drops below a number and thus forcing all FS mounted on the thinpool to go into RO mode.
But in any event it won't prevent irate users from demanding why the space they appear to have isn't actually there.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 10:15 ` Gionatan Danti
@ 2016-05-03 11:42 ` Zdenek Kabelac
2016-05-03 13:15 ` Gionatan Danti
0 siblings, 1 reply; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 11:42 UTC (permalink / raw)
To: LVM general discussion and development
On 3.5.2016 12:15, Gionatan Danti wrote:
>
> On 02/05/2016 16:32, Mark Mielke wrote:
>> The WARNING is a cover-your-ass type warning that is showing up
>> inappropriately for us. It is warning me something that I should already
>> know, and it is training me to ignore warnings. Thinp doesn't have to be
>> the answer to everything. It does, however, need to provide a block
>> device visible to the file system layer, and it isn't invalid for the
>> file system layer to be able to query about the nature of the block
>> device, such as "how much space do you *really* have left?"
>
> As this warning appears on snapshots, it is quite annoying in fact. On the
> other hand, I fully understand that the developers want to avoid "blind"
> overprovisioning. A commmand-line (or a lvm.conf) option to override the
> warning would be welcomed, though.
Since number of reports from people who used thin-pool without realizing what
they could do wrong was too high - rather 'dramatic' WARNING approach is
used. Advised usage is with dmeventd & monitoring.
Danger with having 'disable' options like this is many distros do decide
themselves about best defaults for their users, but Ubuntu with their
issue_discards=1 shown us to be more careful as then it's not Ubuntu but lvm2
which is blamed for dataloss.
Options are evaluated...
>
>> This seems to be a crux of this debate between you and the other people.
>> You think the block storage should be as transparent as possible, as if
>> the storage was not thin. Others, including me, think that this theory
>> is impractical, as it leads to edge cases where the file system could
>> choose to fail in a cleaner way, but it gets too far today leading to a
>> more dangerous failure when it allocates some block, but not some other
>> block.
>>
>> ...
>>
>>
>> It is your opinion that extending thin volumes to allow the file system
>> to have more information is breaking some fundamental law. But, in
>> practice, this sort of thing is done all of the time. "Size", "Read
>> only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
>> are all information queried from the device, and used by the file
>> system. If it is a general concept that applies to many different device
>> targets, and it will help the file system make better and smarter
>> choices, why *shouldn't* it be communicated? Who decides which ones are
>> valid and which ones are not?
>
> This seems reasonable. After all, a simple "lsblk" already reports plenty of
> information to the upper layer, so adding a "REAL_AVAILABLE_SPACE" info should
> not be infeasible.
What's wrong with 'lvs'?
This will give you the available space in thin-pool.
However combining this number with number of free-space in filesystem - that
needs magic.
When you create file with hole in your filesystem - how much free space do you
have ?
If you have 2 filesystem in a single thin-pool - each takes 1/2 ?
It's all about lying....
Regards
Zdenek
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 10:41 ` Mark Mielke
@ 2016-05-03 11:18 ` Zdenek Kabelac
0 siblings, 0 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 11:18 UTC (permalink / raw)
To: Mark Mielke; +Cc: LVM general discussion and development
On 3.5.2016 12:41, Mark Mielke wrote:
> On Tue, May 3, 2016 at 5:45 AM, Zdenek Kabelac <zkabelac@redhat.com
> <mailto:zkabelac@redhat.com>> wrote:
>
> On 2.5.2016 16:32, Mark Mielke wrote:
>
> If you seek for a filesystem with over-provisioning - look at
> btrfs, zfs
> and other variants...
>
> I have to say that I am disappointed with this view, particularly if
> this is a
> view held by Red Hat. To me this represents a misunderstanding of the
> purpose
>
>
> So first - this is AMAZING deduction you've just shown.
>
> You've cut sentence out of the middle of a thread and used as kind of evidence
> that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this
> thread again...
>
>
> My intent wasn't to cut a sentence in the middle. I responded to the each
> sentence in its place. I think it really comes down to this:
>
> This seems to be a crux of this debate between you and the other
> people. You
> think the block storage should be as transparent as possible, as if the
> storage was not thin. Others, including me, think that this theory is
> impractical, as it leads to edge cases where the file system could
> choose to
>
>
> It's purely practical and it's the 'crucial' difference between
>
> i.e. thin+XFS/ext4 and BTRFS.
>
>
>
> I think I captured the crux of this pretty well. If anybody suggests that
> there could be value to exposing any information related to the nature of the
> "thinly provisioned block devices", you suggest that the only route forwards
> here is BTRFS and ZFS. You are saying directly and indirectly, that anybody
> who disagrees with you should switch to what you feel are the only solutions
> that are in this space, and that LVM should never be in this space.
>
> I think I understand your perspective. However, I don't agree with it. I don't
The perspective of lvm2 team is pretty simple as a small team there is
absolutely no time to venture into this road-path.
Also technically you are crying on the wrong grave/barking up the wrong tree.
Try to push you visions to some filesystem developers.
> agree that the best solution is one that fails at the last instant with ENOSPC
> and/or for the file system to become read-only. I think there is a whole lot
> of grey possibilities between the polar extremes of "BTRFS/ZFS" vs
> "thin+XFS/ext4 with last instant failure".
The other point is technical difficulties are very high and you are really
asking for Btrfs logic, you just fail to admit this to yourself.
It's been the 'core' idea of Btrfs to combine volume management and filesystem
together for a better future...
> What started me on this list was the CYA mandatory warning about over
> provisioning that I think is inappropriate, and causing us tooling problems.
> But seeing the debate unfold, and having seen some related failures in the
> Docker LVM thin pool case where the system may completely lock up, I have a
> conclusion that this type of failure represents a fundamental difference in
> opinion around what thin volumes are for, and what place they have. As I see
> them as highly valuable for various reasons including Docker image layers
> (something Red Hat appears to agree with, having targeted LVM thinp instead of
As you mention Docker - again I've no idea why do you think there is 'one-way'
path ?
Red Hat is not political party with a single leading direction.
Many variant are being implemented in parallel (yes even in Red Hat) and the
best one will win over the time - but there is no single 'directive' decision.
It really is the open source way.
> the union file systems), and the snapshot use cases I presented prior, I think
> there must be a way to avoid the worst scenarios, if the right people consider
> all the options, and don't write off options prematurely due to preconceived
> notions about what is and what is not appropriate in terms of communication of
> information between system layers.
>
> There are many types of information that *are* passed from the block device
> layer to the file system layer. I don't see why awareness of thin volumes,
> should not be one of them.
>
Find a use-case, build a patch, show results and add info what the filesystem
shall be doing when the filesystem underlying device changes its characteristic.
There is an API between block-layer and fs-layer - so propose extension
with a patch for a filesystem with clearly defined benefit.
That's my best advice.
> communicating this to the volume layer. The naive approach here might be to
> preallocate these critical blocks before proceeding with any updates to these
> blocks, such that the failure situations can all be "safe" situations, where
> ENOSPC can be returned without a danger of the file system locking up or going
> read-only.
>
> Or, maybe I am out of my depth, and this is crazy talk... :-)
Basically you are not realizing how much work is behind all those simple
sentences. At this moment there is 'fallocate' being in discussion...
But it's most or less 'nuclear weapon' for thin provisioning.
>
> (Personally, I'm not really needing a "df" to approximate available storage...
> I just don't want the system to fail badly in the "out of disk space"
> scenario... I can't speak for others, though... I do *not* want BTRFS/ZFS... I
> just want a sanely behaving LVM + XFS...)
Yes - that's what we try to improve daily.
Regards
Zdenek
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-03 9:45 ` Zdenek Kabelac
@ 2016-05-03 10:41 ` Mark Mielke
2016-05-03 11:18 ` Zdenek Kabelac
0 siblings, 1 reply; 29+ messages in thread
From: Mark Mielke @ 2016-05-03 10:41 UTC (permalink / raw)
To: Zdenek Kabelac; +Cc: LVM general discussion and development
[-- Attachment #1: Type: text/plain, Size: 4497 bytes --]
On Tue, May 3, 2016 at 5:45 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
> On 2.5.2016 16:32, Mark Mielke wrote:
>>
>> If you seek for a filesystem with over-provisioning - look at btrfs,
>> zfs
>> and other variants...
>>
>> I have to say that I am disappointed with this view, particularly if this
>> is a
>> view held by Red Hat. To me this represents a misunderstanding of the
>> purpose
>>
>
> So first - this is AMAZING deduction you've just shown.
>
> You've cut sentence out of the middle of a thread and used as kind of
> evidence
> that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this
> thread again...
>
My intent wasn't to cut a sentence in the middle. I responded to the each
sentence in its place. I think it really comes down to this:
This seems to be a crux of this debate between you and the other people. You
>> think the block storage should be as transparent as possible, as if the
>> storage was not thin. Others, including me, think that this theory is
>> impractical, as it leads to edge cases where the file system could choose
>> to
>>
>
> It's purely practical and it's the 'crucial' difference between
>
> i.e. thin+XFS/ext4 and BTRFS.
>
I think I captured the crux of this pretty well. If anybody suggests that
there could be value to exposing any information related to the nature of
the "thinly provisioned block devices", you suggest that the only route
forwards here is BTRFS and ZFS. You are saying directly and indirectly,
that anybody who disagrees with you should switch to what you feel are the
only solutions that are in this space, and that LVM should never be in this
space.
I think I understand your perspective. However, I don't agree with it. I
don't agree that the best solution is one that fails at the last instant
with ENOSPC and/or for the file system to become read-only. I think there
is a whole lot of grey possibilities between the polar extremes of
"BTRFS/ZFS" vs "thin+XFS/ext4 with last instant failure".
What started me on this list was the CYA mandatory warning about over
provisioning that I think is inappropriate, and causing us tooling
problems. But seeing the debate unfold, and having seen some related
failures in the Docker LVM thin pool case where the system may completely
lock up, I have a conclusion that this type of failure represents a
fundamental difference in opinion around what thin volumes are for, and
what place they have. As I see them as highly valuable for various reasons
including Docker image layers (something Red Hat appears to agree with,
having targeted LVM thinp instead of the union file systems), and the
snapshot use cases I presented prior, I think there must be a way to avoid
the worst scenarios, if the right people consider all the options, and
don't write off options prematurely due to preconceived notions about what
is and what is not appropriate in terms of communication of information
between system layers.
There are many types of information that *are* passed from the block device
layer to the file system layer. I don't see why awareness of thin volumes,
should not be one of them.
For example, and I'm not pretending this is the best idea that should be
implemented, but just to see where the discussion might lead:
The Linux kernel needs to deal with problems such as memory being swapped
out due to memory pressures. In various cases, it is dangerous to swap
memory out. The memory can be protected from being swapped out where
required using various technique such as pinning pages. This takes up extra
RAM, but ensures that the memory can be safely accessed and written as
required. If the file system has particular areas of importance that need
to be writable to prevent file system failure, perhaps the file system
should have a way of communicating this to the volume layer. The naive
approach here might be to preallocate these critical blocks before
proceeding with any updates to these blocks, such that the failure
situations can all be "safe" situations, where ENOSPC can be returned
without a danger of the file system locking up or going read-only.
Or, maybe I am out of my depth, and this is crazy talk... :-)
(Personally, I'm not really needing a "df" to approximate available
storage... I just don't want the system to fail badly in the "out of disk
space" scenario... I can't speak for others, though... I do *not* want
BTRFS/ZFS... I just want a sanely behaving LVM + XFS...)
--
Mark Mielke <mark.mielke@gmail.com>
[-- Attachment #2: Type: text/html, Size: 6095 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-02 14:32 ` Mark Mielke
2016-05-03 9:45 ` Zdenek Kabelac
@ 2016-05-03 10:15 ` Gionatan Danti
2016-05-03 11:42 ` Zdenek Kabelac
1 sibling, 1 reply; 29+ messages in thread
From: Gionatan Danti @ 2016-05-03 10:15 UTC (permalink / raw)
To: LVM general discussion and development
On 02/05/2016 16:32, Mark Mielke wrote:
>
> 2) Frequent snapshots. In many of our use cases, we may take snapshots
> every 15 minutes, every hour, and every day, keeping 3 or more of each.
> If this storage had to be allocated in full, this amounts to at least
> 10X the storage cost. Using snapshots, and understanding the rate of
> churn, we can use closer to 1X or 2X the storage overhead, instead of
> 10X the storage overhead.
>
> 3) Snapshot as a means of achieving a consistent backup at low cost of
> outage or storage overhead. If we "quiesce" the application (flush
> buffers, put new requests on hold, etc.) take the snapshot, and then
> "resume" the application, this can be achieved in a matter of seconds or
> less. Then, we can mount the snapshot at a separate mount point and
> proceed with a more intensive backup process against a particular
> consistent point-in-time. This can be fast and require closer to 1X the
> storage overhead, instead of 2X the storage overhead.
>
This is exactly my main use case.
>
>
> The WARNING is a cover-your-ass type warning that is showing up
> inappropriately for us. It is warning me something that I should already
> know, and it is training me to ignore warnings. Thinp doesn't have to be
> the answer to everything. It does, however, need to provide a block
> device visible to the file system layer, and it isn't invalid for the
> file system layer to be able to query about the nature of the block
> device, such as "how much space do you *really* have left?"
As this warning appears on snapshots, it is quite annoying in fact. On
the other hand, I fully understand that the developers want to avoid
"blind" overprovisioning. A commmand-line (or a lvm.conf) option to
override the warning would be welcomed, though.
> This seems to be a crux of this debate between you and the other people.
> You think the block storage should be as transparent as possible, as if
> the storage was not thin. Others, including me, think that this theory
> is impractical, as it leads to edge cases where the file system could
> choose to fail in a cleaner way, but it gets too far today leading to a
> more dangerous failure when it allocates some block, but not some other
> block.
>
> ...
>
>
> It is your opinion that extending thin volumes to allow the file system
> to have more information is breaking some fundamental law. But, in
> practice, this sort of thing is done all of the time. "Size", "Read
> only", "Discard/Trim Support", "Physical vs Logical Sector Size", ...
> are all information queried from the device, and used by the file
> system. If it is a general concept that applies to many different device
> targets, and it will help the file system make better and smarter
> choices, why *shouldn't* it be communicated? Who decides which ones are
> valid and which ones are not?
This seems reasonable. After all, a simple "lsblk" already reports
plenty of information to the upper layer, so adding a
"REAL_AVAILABLE_SPACE" info should not be infeasible.
>
> I didn't disagree with all of your points. But, enough of them seemed to
> be directly contradicting my perspective on the matter that I felt it
> important to respond to them.
>
Thinp really is a wonderful piece of technology, and I really thanks the
developer for it.
>
>
> --
> Mark Mielke <mark.mielke@gmail.com <mailto:mark.mielke@gmail.com>>
>
>
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-05-02 14:32 ` Mark Mielke
@ 2016-05-03 9:45 ` Zdenek Kabelac
2016-05-03 10:41 ` Mark Mielke
2016-05-03 10:15 ` Gionatan Danti
1 sibling, 1 reply; 29+ messages in thread
From: Zdenek Kabelac @ 2016-05-03 9:45 UTC (permalink / raw)
To: LVM general discussion and development, mark.mielke
On 2.5.2016 16:32, Mark Mielke wrote:
>
> On Fri, Apr 29, 2016 at 7:23 AM, Zdenek Kabelac <zkabelac@redhat.com
> <mailto:zkabelac@redhat.com>> wrote:
>
> Thin-provisioning is NOT about providing device to the upper
> system levels and inform THEM about this lie in-progress.
> That's complete misunderstanding of the purpose.
>
>
> I think this line of thought is a bit of a strawman.
>
> Thin provisioning is entirely about presenting the upper layer with a logical
> view which does not match the physical view, including the possibility for
> such things as over provisioning. How much of this detail is presented to the
> higher layer is an implementation detail and has nothing to do with "purpose".
> The purpose or objective is to allow volumes that are not fully allocated in
> advance. This is what "thin" means, as compared to "thick".
>
> If you seek for a filesystem with over-provisioning - look at btrfs, zfs
> and other variants...
>
>
> I have to say that I am disappointed with this view, particularly if this is a
> view held by Red Hat. To me this represents a misunderstanding of the purpose
Hi
So first - this is AMAZING deduction you've just shown.
You've cut sentence out of the middle of a thread and used as kind of evidence
that Red Hat is suggesting usage of ZFS, Btrfs - sorry man - read this thread
again...
Personally I'd never use those 2 filesystems as they are to complex for
recovery. But I've no problem to advice users to try them if that's what fits
their needs best and they believe into 'all in once logic'
('Hit the wall' is best learning exercise in Xen case anyway...)
> When a storage provider providers a block device (EMC, NetApp, ...) and a
> snapshot capability, I expect to be able to take snapshots with low overhead.
> The previous LVM model for snapshots was really bad, in that it was not low
> overhead. We use this capability for many purposes including:
This usage is perfectly fine. It's been designed this way from day 1.
> 1) Instantiating test environments or dev environments from a snapshot of
> production, with copy-on-write to allow for very large full-scale environments
> to be constructed quickly and with low overhead. In one of our examples, this
> includes an example where we have about 1 TByte of JIRA and Confluence
> attachments collected over several years. It is exposed over NFS by the NetApp
> device, but in the backend it is a volume. This volume is snapshot and then
> exposed as a different volume with copy-on-write characteristics. The storage
> allocation is monitored, and if it is exceeded, it is known that there will be
> particular behaviour. I believe in our case, the behaviour is that the
> snapshot becomes unusable.
Thin pool does not make a difference between snapshot and origin.
All thin-volumes share the same volume space.
It's up to monitoring application to decide if some snapshots could be erased
to reclaim some space in thin-pool.
Recent tool thin_ls is showing info how much data are exclusively held by
individual thin volumes.
It's major difference compared with old snapshots and it's 'Invalidation' logic.
>
> 2) Frequent snapshots. In many of our use cases, we may take snapshots every
> 15 minutes, every hour, and every day, keeping 3 or more of each. If this
> storage had to be allocated in full, this amounts to at least 10X the storage
> cost. Using snapshots, and understanding the rate of churn, we can use closer
> to 1X or 2X the storage overhead, instead of 10X the storage overhead.
Sure - snapper... whatever you name.
It's just for admin to maintain space availability in thin-pool.
> 3) Snapshot as a means of achieving a consistent backup at low cost of outage
> or storage overhead. If we "quiesce" the application (flush buffers, put new
> requests on hold, etc.) take the snapshot, and then "resume" the application,
> this can be achieved in a matter of seconds or less. Then, we can mount the
> snapshot at a separate mount point and proceed with a more intensive backup
> process against a particular consistent point-in-time. This can be fast and
> require closer to 1X the storage overhead, instead of 2X the storage overhead.
>
> In all of these cases - we'll buy more storage if we need more storage. But,
> we're not going to use BTRFS or ZFS to provide the above capabilities, just
And where exactly I'd advised to you specifically to switch to those filesystem?
My advice is clearly given to a user who seeks for filesystem COMBINED with
block layer.
> because this is your opinion on the matter. Storage vendors of reputation and
> market presence sell these capabilities as features, and we pay a lot of money
> to have access to these features.
>
> In the case of LVM... which is really the point of this discussion... LVM is
> not necessarily going to be used or available on a storage appliance. The LVM
> use case, at least for us, is for storage which is thinly provisioned by the
> compute host instead of the backend storage appliance. This includes:
>
> 1) Local disks, particularly included local flash drives that are local to
> achieve higher levels of performance than can normally be achieved with a
> remote storage appliance.
>
> 2) Local file systems, on remote storage appliances, using a protocol such as
> iSCSI to access the backend block device. This might be the case where we need
> better control of the snapshot process, or to abstract the management of the
> snapshots from the backend block device. In our case, we previously use an EMC
> over iSCSI for one of these use cases, and we are switching to NetApp.
> However, instead of embedding NetApp-specific logic into our code, we want to
> use LVM on top of iSCSI, and re-use the LVM thin pool capabilities from the
> host, such that we don't care what storage is used on the backend. The
> management scripts will work the same whether the storage is local (the first
> case above) or not (the case we are looking into now).
>
> In both of these cases, we have a need to take snapshots and manage them
> locally on the host, instead of managing them on a storage appliance. In both
> cases, we want to take many light weight snapshots of the block device. You
> could argue that we should use BTRFS or ZFS, but you should full well know
> that both of these have caveats as well. We want to use XFS or EXT4 as our
> needs require, and still have the ability to take light-weight snapshots.
Which is exactly actual Red Hat strategy. XFS is strongly pushed forward.
> Generally, I've seen the people who argue that thin provisioning is a "lie",
> tend to not be talking about snapshots. I have a sense that you are talking
> more as storage providers for customers, and talking more about thinly
> provisioning content for your customers. In this case - I think I would agree
> that it is a "lie" if you don't make sure to have the storage by the time it
Thin-provisioning simply requires RESPONSIBLE admins - if you are not willing
to take care about your thin-pools - don't use them - lots of kitten may die
and that's all what this thread was about - it had absolutely nothing to do
with Red Hat and any of your conspiracy theories like it would be pushing you
to switch to a filesystem you don't like...
> Device target is definitely not here to solve filesystem troubles.
> Thinp is about 'promising' - you as admin promised you will provide
> space - we could here discuss maybe that LVM may possibly maintain
> max growth size we can promise to user - meanwhile - it's still the admin
> who creates thin-volume and gets WARNING if VG is not big enough when all
> thin volumes would be fully provisioned.
> And THAT'S IT - nothing more.
> So please avoid making thinp target to be answer to ultimate question of
> life, the universe, and everything - as we all know it's 42...
>
>
> The WARNING is a cover-your-ass type warning that is showing up
> inappropriately for us. It is warning me something that I should already know,
> and it is training me to ignore warnings. Thinp doesn't have to be the answer
> to everything. It does, however, need to provide a block device visible to the
> file system layer, and it isn't invalid for the file system layer to be able
> to query about the nature of the block device, such as "how much space do you
> *really* have left?"
This is not so useful information - as this state is dynamic.
The only 'valid' query is - are we out-of-space...
And that's what you get from block layer now - ENOSPC.
Filesystems may have different reaction then to plain EIO.
I'd be really curious what would be the use case of this information even ?
If you care about i.e. 'df' - then let's fix 'df' - it may check fs is
thinly provisioned volume and may ask provisioner about free space in pool and
combine result in some way...
Just DO NOT mix this with filesystem layer...
What would the filesystem do with this info ?
Should this randomly decide to drop files according to thin-pool workload ?
Would you change every filesystem in kernel to implement such policies ?
It's really the thin-pool monitoring which tries to add some space when it's
getting low and may implement further policies to i.e. drop some snapshots.
However what is being implemented is better 'allocation' logic for pool chunk
provisioning (for XFS ATM) - as rather 'dated' methods for deciding where to
store incoming data do not apply with provisioned chunks efficiently.
> This seems to be a crux of this debate between you and the other people. You
> think the block storage should be as transparent as possible, as if the
> storage was not thin. Others, including me, think that this theory is
> impractical, as it leads to edge cases where the file system could choose to
It's purely practical and it's the 'crucial' difference between
i.e. thin+XFS/ext4 and BTRFS.
> fail in a cleaner way, but it gets too far today leading to a more dangerous
> failure when it allocates some block, but not some other block.
The best thing to do is to stop immediately on error and do 'read-only' fs -
what is exactly 'ext4 + remount-ro'
Your proposal to make XFS a different kind of BTRFS monster is simply not
going to work - that's exactly what BTRFS is doing - waste of time to do it
again.
BTRFS has built-in volume manager and combines fs layer with block layer
(making many layers in kernel quite ugly - i.e. device major:minor)
This is different logic lvm2 takse - where layers are separated with clearly
defined logic.
So again - if you don't like separate thin block layer + XFS fs layer and you
want to see 'merged' technology - there is BTRFS/ZFS/.... which tries to
combine raid/caching/encryption/snapshot... - but there are no plans to
'reinvent' the same from the other side with lvm2/dm....
> Exaggerating this to say that thinp would become everything, and the answer to
> the ultimate question of life, weakens your point to me, as it means that you
> are seeing things in far too black + white, whereas real life is often not
> black + white.
Yes we prefer clearly defined borders and responsibilities which could be well
tested and verified..
Don't compare life with software :)
>
> It is your opinion that extending thin volumes to allow the file system to
> have more information is breaking some fundamental law. But, in practice, this
> sort of thing is done all of the time. "Size", "Read only", "Discard/Trim
> Support", "Physical vs Logical Sector Size", ... are all information queried
> from the device, and used by the file system. If it is a general concept that
> applies to many different device targets, and it will help the file system
> make better and smarter choices, why *shouldn't* it be communicated? Who
> decides which ones are valid and which ones are not?
lvm2 is logical volume manager. Just think about it.
In future your thinLV might be turned into plain 'linear' LV as well as your
linearLV would become a member of thin-pool (planned features).
Your LV could be pvmove(ed) to completely different drive with different
geometry...
These are topics for lvm2/dm.
We are not designing filesystem - and we do plan to stay transparent for them.
And it's up to you to understand the reasoning.
> I didn't disagree with all of your points. But, enough of them seemed to be
> directly contradicting my perspective on the matter that I felt it important
> to respond to them.
It is an Open Souce World - "so send a patch" and implement your visions -
again it is that easy - we do it every day in Red Hat...
> Mostly, I think everybody has a set of opinions and use cases in mind when
> they come to their conclusions. Please don't ignore mine. If there is
> something unreasonable above, please let me know.
It's not about ignoring - it's about having certain amount of man-hours for
work and you have to chose how to 'spend' them.
And in this case and your ideas you will need to spend/invest your time....
(Just like Xen).
Regards
Zdenek
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-29 11:23 ` Zdenek Kabelac
@ 2016-05-02 14:32 ` Mark Mielke
2016-05-03 9:45 ` Zdenek Kabelac
2016-05-03 10:15 ` Gionatan Danti
2016-05-03 12:42 ` Xen
1 sibling, 2 replies; 29+ messages in thread
From: Mark Mielke @ 2016-05-02 14:32 UTC (permalink / raw)
To: LVM general discussion and development
[-- Attachment #1: Type: text/plain, Size: 9519 bytes --]
On Fri, Apr 29, 2016 at 7:23 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
> Thin-provisioning is NOT about providing device to the upper
> system levels and inform THEM about this lie in-progress.
> That's complete misunderstanding of the purpose.
>
I think this line of thought is a bit of a strawman.
Thin provisioning is entirely about presenting the upper layer with a
logical view which does not match the physical view, including the
possibility for such things as over provisioning. How much of this detail
is presented to the higher layer is an implementation detail and has
nothing to do with "purpose". The purpose or objective is to allow volumes
that are not fully allocated in advance. This is what "thin" means, as
compared to "thick".
> If you seek for a filesystem with over-provisioning - look at btrfs, zfs
> and other variants...
>
I have to say that I am disappointed with this view, particularly if this
is a view held by Red Hat. To me this represents a misunderstanding of the
purpose for over-provisioning, and a misunderstanding of why thin volumes
are required. It seems there is a focus on "filesystem" in the above
statement, and that this may be the point of debate.
When a storage provider providers a block device (EMC, NetApp, ...) and a
snapshot capability, I expect to be able to take snapshots with low
overhead. The previous LVM model for snapshots was really bad, in that it
was not low overhead. We use this capability for many purposes including:
1) Instantiating test environments or dev environments from a snapshot of
production, with copy-on-write to allow for very large full-scale
environments to be constructed quickly and with low overhead. In one of our
examples, this includes an example where we have about 1 TByte of JIRA and
Confluence attachments collected over several years. It is exposed over NFS
by the NetApp device, but in the backend it is a volume. This volume is
snapshot and then exposed as a different volume with copy-on-write
characteristics. The storage allocation is monitored, and if it is
exceeded, it is known that there will be particular behaviour. I believe in
our case, the behaviour is that the snapshot becomes unusable.
2) Frequent snapshots. In many of our use cases, we may take snapshots
every 15 minutes, every hour, and every day, keeping 3 or more of each. If
this storage had to be allocated in full, this amounts to at least 10X the
storage cost. Using snapshots, and understanding the rate of churn, we can
use closer to 1X or 2X the storage overhead, instead of 10X the storage
overhead.
3) Snapshot as a means of achieving a consistent backup at low cost of
outage or storage overhead. If we "quiesce" the application (flush buffers,
put new requests on hold, etc.) take the snapshot, and then "resume" the
application, this can be achieved in a matter of seconds or less. Then, we
can mount the snapshot at a separate mount point and proceed with a more
intensive backup process against a particular consistent point-in-time.
This can be fast and require closer to 1X the storage overhead, instead of
2X the storage overhead.
In all of these cases - we'll buy more storage if we need more storage.
But, we're not going to use BTRFS or ZFS to provide the above capabilities,
just because this is your opinion on the matter. Storage vendors of
reputation and market presence sell these capabilities as features, and we
pay a lot of money to have access to these features.
In the case of LVM... which is really the point of this discussion... LVM
is not necessarily going to be used or available on a storage appliance.
The LVM use case, at least for us, is for storage which is thinly
provisioned by the compute host instead of the backend storage appliance.
This includes:
1) Local disks, particularly included local flash drives that are local to
achieve higher levels of performance than can normally be achieved with a
remote storage appliance.
2) Local file systems, on remote storage appliances, using a protocol such
as iSCSI to access the backend block device. This might be the case where
we need better control of the snapshot process, or to abstract the
management of the snapshots from the backend block device. In our case, we
previously use an EMC over iSCSI for one of these use cases, and we are
switching to NetApp. However, instead of embedding NetApp-specific logic
into our code, we want to use LVM on top of iSCSI, and re-use the LVM thin
pool capabilities from the host, such that we don't care what storage is
used on the backend. The management scripts will work the same whether the
storage is local (the first case above) or not (the case we are looking
into now).
In both of these cases, we have a need to take snapshots and manage them
locally on the host, instead of managing them on a storage appliance. In
both cases, we want to take many light weight snapshots of the block
device. You could argue that we should use BTRFS or ZFS, but you should
full well know that both of these have caveats as well. We want to use XFS
or EXT4 as our needs require, and still have the ability to take
light-weight snapshots.
Generally, I've seen the people who argue that thin provisioning is a
"lie", tend to not be talking about snapshots. I have a sense that you are
talking more as storage providers for customers, and talking more about
thinly provisioning content for your customers. In this case - I think I
would agree that it is a "lie" if you don't make sure to have the storage
by the time it is required. But, I think this is a very small use case in
reality. I think large service providers would use Ceph or EMC or NetApp,
or some such technology to provision large amounts of storage per customer,
and LVM would be used more at the level of a single customer, or a single
machine. In these cases, I would expect that LVM thin volumes should not be
used across multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation that would be
required. In the case of our IT team and EMC or NetApp, they mostly avoid
the use of thin volumes for "cross customer" purposes, and instead use thin
volumes for a specific customer, for a specific need. In the case of Amazon
EC2, for example... I would use EBS for storage, and expect that even if it
is "thin", Amazon would make sure to have enough storage to meet my
requirement if I need them. But, I would use LVM on my Amazon EC2 instance,
and I would expect to be able to use LVM thin pool snapshots to over
provision my own per-machine storage requirements by creating multiple
snapshots of the underlying storage, with a full understanding of the
amount of churn that I expect to occur, and a full understanding of the
need to monitor.
> Device target is definitely not here to solve filesystem troubles.
> Thinp is about 'promising' - you as admin promised you will provide
> space - we could here discuss maybe that LVM may possibly maintain
> max growth size we can promise to user - meanwhile - it's still the admin
> who creates thin-volume and gets WARNING if VG is not big enough when all
> thin volumes would be fully provisioned.
> And THAT'S IT - nothing more.
> So please avoid making thinp target to be answer to ultimate question of
> life, the universe, and everything - as we all know it's 42...
The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block device
visible to the file system layer, and it isn't invalid for the file system
layer to be able to query about the nature of the block device, such as
"how much space do you *really* have left?"
This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose
to fail in a cleaner way, but it gets too far today leading to a more
dangerous failure when it allocates some block, but not some other block.
Exaggerating this to say that thinp would become everything, and the answer
to the ultimate question of life, weakens your point to me, as it means
that you are seeing things in far too black + white, whereas real life is
often not black + white.
It is your opinion that extending thin volumes to allow the file system to
have more information is breaking some fundamental law. But, in practice,
this sort of thing is done all of the time. "Size", "Read only",
"Discard/Trim Support", "Physical vs Logical Sector Size", ... are all
information queried from the device, and used by the file system. If it is
a general concept that applies to many different device targets, and it
will help the file system make better and smarter choices, why *shouldn't*
it be communicated? Who decides which ones are valid and which ones are not?
I didn't disagree with all of your points. But, enough of them seemed to be
directly contradicting my perspective on the matter that I felt it
important to respond to them.
Mostly, I think everybody has a set of opinions and use cases in mind when
they come to their conclusions. Please don't ignore mine. If there is
something unreasonable above, please let me know.
--
Mark Mielke <mark.mielke@gmail.com>
[-- Attachment #2: Type: text/html, Size: 11013 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-28 18:25 ` Xen
@ 2016-04-29 11:23 ` Zdenek Kabelac
2016-05-02 14:32 ` Mark Mielke
2016-05-03 12:42 ` Xen
0 siblings, 2 replies; 29+ messages in thread
From: Zdenek Kabelac @ 2016-04-29 11:23 UTC (permalink / raw)
To: linux-lvm
On 28.4.2016 20:25, Xen wrote:
> Continuing from previous mail I guess. But I realized something.
>
>> A responsible sysadmin who chose to use thin pools might configure the
>> initial FS size to be some modest size well within the constraints of
>> the actual block store, and then as the FS hit say 85% utilization to
>> run a script that investigated the state of the block layer and use
>> resize2fs and friends to grow the FS and let the thin-pool likewise grow
>> to fit as IO gets issued. But at some point when the competing demands
>> of other FS on thin-pool were set to breach actual block availability
>> the FS growth would be denied and thus userland would get signaled by
>> the FS layer that it's out of space when it hit 100% util.
>
> Well of course what you describe here are increasingly complex strategies
> that require development and should not be put on invidual administrators
> (or even organisations) to devise and come up with.
>
> Growing filesystems? If you have a platform where continous thin pool
> growth is possible (and we are talking of well developed, complex setups
> here) then maybe you have in-house tools to take care of all of that.
>
> So you suggest a strategy here that involves both intelligent automatic
> administration of the FS layer as well as the block layer.
>
> A concerted strategy where for example you do have a defined thin volume
> size but you constrain your FS artificially AND depend its intelligence on
> knowledge of your thin pool size. And then you have created an
> intelligence where the "filesystem agent" can request growth, and perhaps
> the "block level agent" may grant or deny it such that FS growth is staged
> and given hard limits at every point. And then you have the same
> functionality as what I described other than that it is more sanely
> constructed at intervals.
>
> No continuous updating, but staged growth intervals or moments.
I'm not going to add much to this thread - since there is nothing really
useful for devel. But let me strike out few important moments:
Thin-provisioning is NOT about providing device to the upper
system levels and inform THEM about this lie in-progress.
That's complete misunderstanding of the purpose.
If you seek for a filesystem with over-provisioning - look at btrfs, zfs and
other variants...
Device target is definitely not here to solve filesystem troubles.
Thinp is about 'promising' - you as admin promised you will provide
space - we could here discuss maybe that LVM may possibly maintain
max growth size we can promise to user - meanwhile - it's still the admin
who creates thin-volume and gets WARNING if VG is not big enough when all thin
volumes would be fully provisioned.
And THAT'S IT - nothing more.
So please avoid making thinp target to be answer to ultimate question of life,
the universe, and everything - as we all know it's 42...
>
>> But either way if you have a sudden burst of I/O from competing
>> interests in the thin-pool, what appeared to be a safe growth allocation
>> at one instant of time is not likely to be true when actual writes try
>> to get fulfilled.
>
> So in the end monitoring is important but because you use a thin pool
> there are like 3 classes of situations that change:
>
> * Filesystems will generally have more leeway because you are /able/ to
> provide them with more (virtual) space to begin with, in the assumption
> that you won't readily need it, but it's normally going to be there when
> it does.
So you try to design 'another btrfs' on top of thin provisioning?
> * Thin volumes do allow you to make better use of the available space (as
> per btrfs, I guess) and give many advantages in moving data around.
With 'thinp' you want simplest filesystem with robust metadata - so in
theory - 'ext4' or XFS without all 'improvements for rotational hdd that
has accumulated over decades of their evolution.
> 1. Unless you monitor it directly in some way, the lack of information is
> going to make you feel rather annoyed and insecure
>
> 2. Normally user tools do inform you of system status (a user-run "ls" or
> "df" is enough) but you cannot have lvs information unless run as root.
You miss the 'key' details.
Thin pool is not constructing 'free-maps' for each LV all the time - that's
why tools like 'thin_ls' are meant to be used from the user-space.
It IS very EXPENSIVE operation.
So before you start to present your visions here, please spend some time with
reading doc and understanding all the technology behind it.
> Even with a perfect LVM monitoring tool, I would experience a consistent
> lack of feedback.
Mistake of your expectations
If you are trying to operate thin-pool near 100% fullness - you will need to
write and design completely different piece of software - sorry thinp
is not for you and never will...
Simply use 'fully' provisioned - aka - already existing standard volumes.
>
> Just a simple example: I can adjust "df" to do different stuff. But any
> program reporting free diskspace is going to "lie" to me in that sense. So
> yes I've chosen to use thin LVM because it is the best solution for me
> right now.
'df' has nothing in common with 'block' layer.
> Technically I consider autoextend not that great of a solution either.
>
> It begs the question: why did you not start out with a larger volume in
> the first place? You going to keep adding disks as the thing grows?
Very simple answer and related of to misunderstanding of the purpose.
Take it as motivation like you want to reduce amount of active device in your
i.e. 'datacenter'.
So you start with 1TB volume - while the user may immediately create and
format and use i.e. 10TB volume. As the volume fill over the time - you add
more devices to your vg (buy/pay for more disk space/energy).
But user doesn't have to resize his filesystem or have other costs with
maintenance of slowly growing filesystem.
Of course if the first thing user will do is to i.e. 'dd' full 10TB volume
the are not going to be any savings!
But if you've never planned to buy 10TB - you should have never allow to
create such big volume in the first place!
With thinp you basically postpone or skip (fsresize) some operations.
> An overprovisioned system with individual volumes that individually cannot
> reach their max size is a bad system.
Yes - it is bad system.
So don't do it - and don't plan to use it - it's really that simple.
ThinP is NOT virtual disk-space for free...
> Thin pools lie. Yes. But it's not a lie of the space is available. It's
> only a lie if the space is no longer available!!!!!!!.
>
> It is not designed to lie.
Actually it's the core principle!
It lies (or better say uses admin's promises) that there is going to be a disk
space. And it's admin responsibility to fulfill it.
If you know in front you will need quickly all the disk space - then using
thinp and expecting miracle is not going to work.
Regards
Zdenek
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-28 10:43 ` matthew patton
2016-04-28 18:20 ` Xen
@ 2016-04-28 18:25 ` Xen
2016-04-29 11:23 ` Zdenek Kabelac
1 sibling, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-28 18:25 UTC (permalink / raw)
To: LVM general discussion and development
Continuing from previous mail I guess. But I realized something.
> A responsible sysadmin who chose to use thin pools might configure the
> initial FS size to be some modest size well within the constraints of
> the actual block store, and then as the FS hit say 85% utilization to
> run a script that investigated the state of the block layer and use
> resize2fs and friends to grow the FS and let the thin-pool likewise
> grow
> to fit as IO gets issued. But at some point when the competing demands
> of other FS on thin-pool were set to breach actual block availability
> the FS growth would be denied and thus userland would get signaled by
> the FS layer that it's out of space when it hit 100% util.
Well of course what you describe here are increasingly complex
strategies
that require development and should not be put on invidual
administrators
(or even organisations) to devise and come up with.
Growing filesystems? If you have a platform where continous thin pool
growth is possible (and we are talking of well developed, complex setups
here) then maybe you have in-house tools to take care of all of that.
So you suggest a strategy here that involves both intelligent automatic
administration of the FS layer as well as the block layer.
A concerted strategy where for example you do have a defined thin volume
size but you constrain your FS artificially AND depend its intelligence
on
knowledge of your thin pool size. And then you have created an
intelligence where the "filesystem agent" can request growth, and
perhaps
the "block level agent" may grant or deny it such that FS growth is
staged
and given hard limits at every point. And then you have the same
functionality as what I described other than that it is more sanely
constructed at intervals.
No continuous updating, but staged growth intervals or moments.
> But either way if you have a sudden burst of I/O from competing
> interests in the thin-pool, what appeared to be a safe growth
> allocation
> at one instant of time is not likely to be true when actual writes try
> to get fulfilled.
So in the end monitoring is important but because you use a thin pool
there are like 3 classes of situations that change:
* Filesystems will generally have more leeway because you are /able/ to
provide them with more (virtual) space to begin with, in the assumption
that you won't readily need it, but it's normally going to be there when
it does.
* Hard limits in the filesystem itself is still a use case that has no
good solution; most applications will start crashing or behaving weirdly
when out of diskspace. Freezing a filesystem (when it is not a system
disk) might be equally well of a good mitigation strategy as anything
that
involves "oh no, I am out of diskspace and now I am going to ensure
endless trouble as processes keep trying to write to that empty space -
that nonexistent space". If anything I don't think most systems
gracefully
recover from that.
Creating temporary filesystems for important parts is not all that bad.
* Thin volumes do allow you to make better use of the available space
(as
per btrfs, I guess) and give many advantages in moving data around.
The only detriment really to thin for a desktop power user, so to speak
is:
1. Unless you monitor it directly in some way, the lack of information
is
going to make you feel rather annoyed and insecure
2. Normally user tools do inform you of system status (a user-run "ls"
or
"df" is enough) but you cannot have lvs information unless run as root.
The system-config-lvm tool just runs as setuid. I can add volumes
without
authenticating as root.
Regular command line tools are not accessible to the user.
So what I have been suggesting obviously seeks to address point 2. I am
more than willing to address point 1 by developing something, but I'm
not
sure I will ever be able to develop again in this bleak sense of decay I
am experiencing life to be currently ;-).
Anyhow, it would never fully satisfy for me.
Even with a perfect LVM monitoring tool, I would experience a consistent
lack of feedback.
Just a simple example: I can adjust "df" to do different stuff. But any
program reporting free diskspace is going to "lie" to me in that sense.
So
yes I've chosen to use thin LVM because it is the best solution for me
right now.
At the same time indeed, I lack information and this information cannot
be
sourced directly from the block layer because that's not how computer
software works. Computer software doesn't interface with the block
layer.
They interface with filesystems and report information from there.
Technically I consider autoextend not that great of a solution either.
It begs the question: why did you not start out with a larger volume in
the first place? You going to keep adding disks as the thing grows?
I mean, I don't know. If I'm some VPS user and I'm running on a
thinly-provisioned host. Maybe it's nice to be oblivious. But unless my
host has a perfect failsafe setup, the only time I am going to be
notified
of failure is if my volume (that I don't know about) drops or freezes.
Would I personally like having a tool that would show at some point
something going wrong at the lower level? I think I would.
An overprovisioned system with individual volumes that individually
cannot
reach their max size is a bad system.
That they can't do it all at the same time is not that much of a
problem.
That is not very important.
Yet considering a different situation -- suppose this is a host with few
clients but high data requirements. Suppose there are only 4 thin
volumes.
And suppose every thin volume is going to be something of 2TB or make it
anything as large as you want.
(I just have 50GB on my vps). Suppose you had a 6TB disk and you
provisioned it for 4 clients x 2TB. Economies of scale only start to
really show their benefit with much higher number of clients. With 200
clients the "averaging" starts to work in your favour giving you a
dependable system that is not going to suddenly do something weird.
But with smaller numbers you do run into the risk of something going
amiss.
The only reason lack of feedback would not be important for your clients
is if you had a large enough pool, and individual volumes would be just
a
small part of that pool, say 50-100 volumes per pool.
So I guess I'm suggesting there may be a use case for thin LVM in which
you do not have this >10 number of volumes sitting in any pool.
And at that point personally even if I'm the client of that system, I do
want to be informed.
And I would prefer to be informed *through* the pipe that already
exists.
Thin pools lie. Yes. But it's not a lie of the space is available. It's
only a lie if the space is no longer available!!!!!!!.
It is not designed to lie.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-28 10:43 ` matthew patton
@ 2016-04-28 18:20 ` Xen
2016-04-28 18:25 ` Xen
1 sibling, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-28 18:20 UTC (permalink / raw)
To: LVM general discussion and development
Let me just write down some thoughts here.
First of all you say that fundamental OS design is about higher layers
trusting lower layers and that certain types of communications should then
always be one way.
In this case it is about block layer vs. file system layer.
But you make certain assumptions about the nature of a block device to
begin with.
A block device is defined by its acess method (ie. data organized in
blocks) rather than its contigiousness or having an unchanging, "single
block" address or access space. I know this goes pretty far but it is the
truth.
In theory there is nothing against a hypothetical block device offering
ranges of blocks to a higher level (that might never change) or to be
dynamically notified of changes to that address pool.
To a process virtual memory is a space that is transparent to it whether
that space is constructed of paged memory (swap file) or not. At the same
time it is not impossible to imagine that an IO scheduler for swap would
take heed of values given by applications, such as using nice or ionice
values. That would be one way communication though.
In general a higher level should be oblivious to what kind of lower level
layer it is running on, you are right. Yet if all lower levels exhibit the
same kind of features, this point becomes moot, because at that point the
higher level will not be able to know, once more, precisely what kind of
layer it is running on, although it would have more information.
So just theoretically speaking the only thing that is required to be
consistent is the API or whatever interface you design for it.
I think there are many cases where some software can run on some libraries
but not on others because those other libraries do not offer the full
feature set of whatever standard is being defined there. An example is
DLNA/UPNP, these are not layers but the standard is ill-defined and the
device you are communicating with might not support the full set.
Perhaps these are detrimental issues but there are plenty of cases where
one type of "lower level" will suffice but another won't, think maybe of
graphics drivers. Across the layer boundary, communication is two-way
anyway. The block device *does* supply endless streams of data to the
higher layer. The only thing that would change is that you would no longer
have this "always one contigious block of blocks" but something that is
slightly more volatile.
When you "mkfs" the tool reads the size of the block device. Perhaps
subsequently the filessytem is unaware and depends on fixed values.
The feature I described (use case) would allow the set of blocks that is
available, to dynamically change. You are right that this would apparently
be a big departure from the current model.
So I'm not saying it is easy, perfect, or well understood. I'm just saying
I like the idea.
I don't know what other applications it might have but it depends entirely
on correct "discard" behaviour from the filesystem.
The filesystem should be unaware of its underlying device but discard is
never required for rotating disks as far as I can tell. This is an option
that assumes knowledge of the underlying device. From discard we can
basically infer that either we are dealing with a flash device or
something that has some smartness about what blocks it retains and what
not (think cache).
So in general this is already a change that reflects changing conditions
of block devices in general or its availability. And its characteristic
behaviour or demands from filesystems.
These are block devices that want more information to operate (well).
Coincidentally, discard also favours or enhances (possibly) lvmcache.
So it's not about doing something wildly strange here, it's about offering
a feature set that a filesystem may or may not use, or a block device may
or may not offer.
Contrary to what you say, there is nothing inherently bad about the idea.
The OS design principle violation you speak of is principle, not practical
reality. It's not that it can't be done. It's that you don't want it to
happen because it violates your principles. It's not that it wouldn't
work. It's that you don't like it to work because it violates your
principles.
At the same time I object to the notion of the system administrator being
this theoretical vastly differing role/person than the user/client.
We have no in-betweens on Linux. For fun you should do a search of your
filesystem with find -xdev based on the contents of /etc/passwd or
/etc/group. You will find that 99% of files are owned by root and the only
ones that aren't are usually user files in the home directory or specific
services in /var/lib.
Here is a script that would do it for groups:
cat /etc/group | cut -d: -f1 | while read g; do printf "%-15s %6d" $g
`find / -xdev -type f -group $g | wc -l`; done
Probably. I can't run it here it might crash my system (live dvd).
Of about 170k files on an OpenSUSE system, 15 were group writable, mostly
due to my own interference probably. Of 170197 files (no xdev) 168161 were
owned by root.
Excluding man and my user, 69 files did not have "root" as the group. Part
of that was again due to my own changes.
At the same time in some debates your are presented with the ludicrous
notion that there is some ideal desktop user who doesn't need to ever see
anything of the internal system. She never opens a shell and certainly
does not come across ethernet device names (for example). The "desktop
user" does not care about the naming of devices from /dev/eth0 to
/sys/class/net/enp3s0.
The desktop user never uses anything other than DHCP, etc. etc. etc.
The desktop user never can configure anything without the help of the
admin, if it is slightly more advanced.
It's that user vs. admin dichotomy that is never true on any desktop
system and I will venture it is not even true on the systems I am a client
of, because you often need to debate stuff with the vendor or ask for
features, offer solutions, etc.
In a store you are a client. There are employees and clients, nothing
else. At the same time I treat these girls as my neighbours because they
work in the block I live in.
You get the idea. Roles can be shifty. A person can use multiple roles at
the same time. He/she can be admin and user simulaneously.
Perhaps you are correct to state that the roles themselves should not be
watered down, that clear delimitations are required.
In your other email you allude to me not ever having done an OS design
course.
Offlist a friendly member suggested strongly I not use personal attacks in
my communications here. But of course this is precisely what you are doing
here, because as a matter of fact I did follow such a course.
I don't remember the book we used because apparently between my house mate
and me we only had one exemplar and he ended up getting it because I was
usually the one borrowing stuff from him.
At the same time university is way beyond my current reach (in living
conditions) so it is just an unwarranted allusion that does not have
anything to do with anything really.
Yes I think it was the dinosaur book:
Operating System Concepts by Silberschatz, Galvin and Gagne
Anyway, irrelevant here.
> Another way (haven't tested) to 'signal' the FS as to the true state of
> the underlying storage is to have a sparse file that gets shrunk over
> time.
You do realize you are trying to find ways around the limitation you just
imposed on yourself right?
> The system admin decided it was a bright idea to use thin pools in the
> first place so he necessarily signed up to be liable for the hazards and
> risks that choice entails. It is not the job of the FS to bail his ass
> out.
I don't think thin pools are that risky or should be that risky. They do
incur a management overhead compared to static filesystems because of
adding that second layer you need to monitor. At the same time the burden
of that can be lessened with tools.
As it stands I consider thin LVM the only reasonably way to snapshot a
running system without dedicating specific space to it in advance. I could
expect snaphotting to require stuff to be in the same volume group.
Without LVM thin, snapshotting requires making at least some prior
investment in having a snapshot device ready for you in the same VG,
right?
Do not think btrfs and ZFS are without costs. You wrote:
> Then you want an integrated block+fs implementation. See BTRFS and ZFS.
WAFL and friends.
But btrfs is not without complexity. It uses subvolumes that differ from
distribution to distribution as each makes its own choice. It requires
knowledge of more complicated tools and mechanics to do the simplest (or
most meaningful) of tasks. Working with LVM is easier. I'm not saying LVM
is perfect and....
Using snapshotting as a backup measure is something that seems risky to me
at the first place because it is a "partition table" operation which
really you shouldn't be doing on a consistent basis. So in other to
effectively use it in the first place you require tools that handle the
safeguards for you. Tools that make sure you are not making some command
line mistake. Tools that simply guard against misuse.
Regular users are not fit for being btrfs admins either.
It is going to confuse the hell out of people seeing as that what their
systems run on and if they are introduced to some of the complexity of it.
You say swallow your pride. It has not much to do with pride.
It has to do with ending up in a situation I don't like. That is then
going to "hurt" me for the remainder of my days until I switch back or get
rid of it.
I have seen NOTHING NOTHING NOTHING inspiring about btrfs.
Not having partition tables and sending volumes across space and time to
other systems, is not really my cup of tea.
It is a vendor lock-in system and would result in other technologies being
lesser developed.
I am not alone in this opinion either.
Btrfs feels like a form of illness to me. It is living in a forest with
all deformed trees, instead of something lush and inspiring. If you've
ever played World of Warcraft, the only thing that comes a bit close is
the Felwood area ;-).
But I don't consider it beyond Plaguelands either.
Anyway.
I have felt like btrfs in my life. They have not been the happiest moments
of my life ;-).
I will respond more in another mail, this is getting too long.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
[not found] <929635034.3140318.1461840230292.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-28 10:43 ` matthew patton
2016-04-28 18:20 ` Xen
2016-04-28 18:25 ` Xen
0 siblings, 2 replies; 29+ messages in thread
From: matthew patton @ 2016-04-28 10:43 UTC (permalink / raw)
To: Xen, Marek Podmaka, LVM general discussion and development
> > The real question you should be asking is if it increases the monitoring
> > aspect (enhances it) if thin pool data is seen through the lens of the
> > filesystems as well.
Then you want an integrated block+fs implementation. See BTRFS and ZFS. WAFL and friends.
> kernel for communication from lower fs layers to higher layers -
Correct. Because doing so violates the fundamental precepts of OS design. Higher layers trust lower layers. Thin Pools are outright lying about the real world to anything that uses it's services. That is its purpose. The FS doesn't give a damn that the block layer is lying to it, it can and does assume and rightly so that what the block layer says it has, it indeed does have. The onus of keeping the block layer ahead of the FS falls on a third party - the system admin. The system admin decided it was a bright idea to use thin pools in the first place so he necessarily signed up to be liable for the hazards and risks that choice entails. It is not the job of the FS to bail his ass out.
A responsible sysadmin who chose to use thin pools might configure the initial FS size to be some modest size well within the constraints of the actual block store, and then as the FS hit say 85% utilization to run a script that investigated the state of the block layer and use resize2fs and friends to grow the FS and let the thin-pool likewise grow to fit as IO gets issued. But at some point when the competing demands of other FS on thin-pool were set to breach actual block availability the FS growth would be denied and thus userland would get signaled by the FS layer that it's out of space when it hit 100% util.
Another way (haven't tested) to 'signal' the FS as to the true state of the underlying storage is to have a sparse file that gets shrunk over time.
But either way if you have a sudden burst of I/O from competing interests in the thin-pool, what appeared to be a safe growth allocation at one instant of time is not likely to be true when actual writes try to get fulfilled.
Think of mindless use of thin-pools as trying to cross a heavily mined beach. Bring a long stick and say your prayers because you'r likely going to lose a limb.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-28 6:46 ` Marek Podmaka
@ 2016-04-28 10:33 ` Xen
0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-28 10:33 UTC (permalink / raw)
To: LVM general discussion and development
On Thu, 28 Apr 2016, Marek Podmaka wrote:
> Hello Xen,
>
> Wednesday, April 27, 2016, 23:28:31, you wrote:
>
>> The real question you should be asking is if it increases the monitoring
>> aspect (enhances it) if thin pool data is seen through the lens of the
>> filesystems as well.
>
>> Beside the point here perhaps. But. Let's drop the "real sysadmin"
>> ideology. We are humans. We like things to work for us. "Too easy" is
>> not a valid criticism for not having something.
>
> As far as I know (someone correct me) there is no mechanism at all in
> kernel for communication from lower fs layers to higher layers -
> besides exporting static properties like physical block size. The
> other way (from higher layer like fs to lower layers works fine - for
> example discard support).
I suspected so.
> So even if what you are asking might be valid, it isn't as simple as adding
> some parameter somewhere and it would magically work. It is about
> inventing and standardizing new communication system, which would of
> course work only with new versions of all the tools involved.
Right.
> Anyway, I have no idea what would filesystem itself do with information
> that no more space is available. Also this would work only for lvm
> thin pools, not for thin provision directly from storage, so it would
> be a non-consistent mess. Or you would need another protocol for
> exporting thin-pool related dynamic data from storage (via NAS, SAN,
> iSCSI and all other protocols) to the target system. And in some
> organizations it is not desirable at all to make this kind of
> information visible to all target systems / departments.
Yes I don't know how "thin provision directly from storage" works.
I take it you mean that these protocols you mention are or would be the
channel through which the communication would need to happen that I now
just proposed for LVM.
I take it you mean that these systems offer regular looking devices over
any kind of link, while "secretly" behind the scenes using thin
provisioning for that, and that as such we are or would be dealing with
pretty "hard coded" standards that would require a lot of momentum to
change any of that. In that sense that the client of these storage systems
themselves do not know about the thin provisioning and it is up to the
admin of those systems.,.. yadda yadda yadda.
I feel really stupid now :p.
And to make it worse, is means that in these "hardware" systems the user
and admin are separated, but the same is true if you virtualize and you
offer the same model to your clients. I apologize for my noviceness here
the way I come across.
But I agree that to any client it is not helpful to know about hard limits
that should be oblivious to them provided that the provisioning is done
right.
It would be quite disconcerting to see your total available space suddenly
shrink without being aware of any autoextend mechanism (for instance) and
as such there seems to be a real divide between the "user" and the
"supplier" of any thin volume.
Maybe I have misinterpreted the real use case for thin pools then. But my
feeling is that I am just a bit confused at this point.
> What you are asking can be done for example directly in "df" (or you
> can make a wrapper script), which would not only check the filesystems
> themselves, but also the thin part and display the result in whatever
> format you want.
That is true of course. I have to think about it.
> Also displaying real thin free space for each fs won't be "correct".
> If I see 1 TB free in each filesystem and starting writing, by the
> time I finish, those 1 TB might be taken by the other fs. So
> information about current free space in thinp is useless for me, as in
> 1 minute it could be totally different number.
But the calamity is that if that was really true, and the thing didn't
autoextend, then you'd end up with a frozen system.
So basically it seems at this point a conflict of interests:
- you don't want your clients to know your systems are failing
- they might not even be failing if they autoextend
- you don't want to scare them with in that sense, inaccurate data
- on a desktop system, the user and sysadmin would be the same
- there is not really any provison for graphical tools.
(maybe I should develop one. I so badly want to start coding again).
- a tool that notifies the user about the thin pool would equally well do
the job of informing the user/admin as a filesystem point of data, would
do.
- that implies that the two roles would stay separate.
- desktops seem to be using btrfs now in some distros
I'm concerned with the use case of a desktop user that could employ this
technique. I now understand a bit more perhaps why grub doesn't support
LVM thin.
The management tools for a desktop user also do not exist (except the
command line tools we have).
Well wrong again there is a GUI it is just not very helpful.
It is not helpful at all for monitoring.
It can
* create logical volumes (regular, stripe, mirror)
* move volumes to another PV
* extend volume groups to another PV
And that's about all it can do I guess. Not sure it even needs to do much
more, but it is no monitoring tool of any sophistication.
Let me think some more on this and I apologize for the "out loud"
thinking.
Regards.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-27 21:28 ` Xen
@ 2016-04-28 6:46 ` Marek Podmaka
2016-04-28 10:33 ` Xen
0 siblings, 1 reply; 29+ messages in thread
From: Marek Podmaka @ 2016-04-28 6:46 UTC (permalink / raw)
To: Xen; +Cc: LVM general discussion and development
Hello Xen,
Wednesday, April 27, 2016, 23:28:31, you wrote:
> The real question you should be asking is if it increases the monitoring
> aspect (enhances it) if thin pool data is seen through the lens of the
> filesystems as well.
> Beside the point here perhaps. But. Let's drop the "real sysadmin"
> ideology. We are humans. We like things to work for us. "Too easy" is
> not a valid criticism for not having something.
As far as I know (someone correct me) there is no mechanism at all in
kernel for communication from lower fs layers to higher layers -
besides exporting static properties like physical block size. The
other way (from higher layer like fs to lower layers works fine - for
example discard support).
So even if what you are asking might be valid, it isn't as simple as adding
some parameter somewhere and it would magically work. It is about
inventing and standardizing new communication system, which would of
course work only with new versions of all the tools involved.
Anyway, I have no idea what would filesystem itself do with information
that no more space is available. Also this would work only for lvm
thin pools, not for thin provision directly from storage, so it would
be a non-consistent mess. Or you would need another protocol for
exporting thin-pool related dynamic data from storage (via NAS, SAN,
iSCSI and all other protocols) to the target system. And in some
organizations it is not desirable at all to make this kind of
information visible to all target systems / departments.
What you are asking can be done for example directly in "df" (or you
can make a wrapper script), which would not only check the filesystems
themselves, but also the thin part and display the result in whatever
format you want.
Also displaying real thin free space for each fs won't be "correct".
If I see 1 TB free in each filesystem and starting writing, by the
time I finish, those 1 TB might be taken by the other fs. So
information about current free space in thinp is useless for me, as in
1 minute it could be totally different number.
--
bYE, Marki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-27 12:26 ` matthew patton
@ 2016-04-27 21:28 ` Xen
2016-04-28 6:46 ` Marek Podmaka
0 siblings, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-27 21:28 UTC (permalink / raw)
To: LVM general discussion and development
matthew patton schreef op 27-04-2016 12:26:
> It is not the OS' responsibility to coddle stupid sysadmins. If you're
> not watching for high-water marks in FS growth vis a vis the
> underlying, you're not doing your job. If there was anything more than
> the remotest chance that the FS would grow to full size it should not
> have been thin in the first place.
Who says the only ones who would ever use or consider using thin would
be sysadmins.?
Monitoring Linux is troublesome enough for most people and it really is
a "job".
You seem to be intent on making the job harder rather than easy so you
can be a type of person that has this expert knowledge while others
don't?
I remember a reason to crack down on sysadmins was that they didn't know
how to use "vi" - if you can't use fucking vi, you're not a sysadmin.
This actually is a bloated version of what a system administrator is or
could at all times be expected to do, because you are ensuring that
problems are going to surface one way or another when this sysadmin is
suddenly no longer capable of being this perfect guy at 100% of times.
You are basically ensuring disaster by having that attitude.
That guy that can battle against all odds and still prevail ;-).
More to the point.
No one is getting cuddled because Linux is hard enough and it is usually
the users who are getting cuddled; strangely enough the attitude exists
that the average desktop user never needs to look under the hood. If
something is ugly, who cares, the "average user" doesn't go there.
The average user is oblivious to all system internals.
The system administrator knows everything and can launch a space rocket
with nothing more than matches and some gallons of rocket fuel.
;-).
The autoextend mechanism is designed to prevent calamity when the
filesystem(s) grow to full size. By your reasoning , it should not exist
because it cuddles admins.
A real admin would extend manually.
A real admin would specify the right size in advance.
A real admin would use thin pools of thin pools that expand beyond your
wildest dreams :p.
But on a more serious note, if there is no chance a file system will
grow to full size, then it doesn't need to be that big.
But there are more use cases for thin than hosting VMs for clients.
Also I believe thin pools have a use for desktop systems as well, when
you see that the only alternative really is btrfs and some distros are
going with it full-time. Btrfs also has thin provisioning in a sense but
on a different layer, which is why I don't like it.
Thin pools from my perspective are the only valid snapshotting mechanism
if you don't use btrfs or zfs or something of the kind.
Even a simple desktop monitor, some applet with configured thin pool
data, would of course alleviate a lot of the problems for a "casual
desktop user". If you remotely administer your system with VNC or the
like, that's the same. So I am saying there is no single use case for
thin, and.
Your response mr. patton falls along the lines of "I only want this to
be used by my kind of people".
"Don't turn it into something everyone or anyone can use".
"Please let it be something special and nichie".
You can read coddle in place of cuddle.
It seems to me pretty clear to me that a system that *requires* manual
intervention and monitoring at all times is not a good system,
particularly if the feedback on its current state cannot be retrieved
from, or is usable by, other existing systems that guard against more or
less the same type of things.
Besides, if your arguments here were valid, then
https://bugzilla.redhat.com/show_bug.cgi?id=1189215 would never have
existed.
> The FS already has a notion of 'reserved'. man(1) tune2fs -r
Alright thanks. But those blocks are manually reserved for a specific
user.
That's what they are for. It is for -u. These blocks are still available
to the filesystem.
You could call it calamity prevention as well. There will always be a
certain amount of space for say the root user.
and by the same measure you can also say the tmpfs overflow mechanism
for /tmp is not required either because a real admin would not see his
rootfs go out of diskspace.
Stuff happens. You ensure you are prepared when it does. Not stick your
head in the sand and claim that real gurus never encounter those
situations.
The real question you should be asking is if it increases the monitoring
aspect (enhances it) if thin pool data is seen through the lens of the
filesystems as well.
Or whether that is going to be a detriment.
Regards.
Erratum:
https://utcc.utoronto.ca/~cks/space/blog/tech/SocialProblemsMatter
There is a widespread attitude among computer people that it is a great
pity that their beautiful solutions to difficult technical challenges
are being prevented from working merely by some pesky social issues
[read: human flaws], and that the problem is solved once the technical
work is done. This attitude misses the point, especially in system
administration: broadly speaking, the technical challenges are the easy
problems.
No technical system is good if people can't use it or if it makes
people's lives harder (my words). One good example of course is Git. The
typical attitude you get is that a real programmer has all the skills of
a git guru. Yet git is a git. Git is an asshole system.
Beside the point here perhaps. But. Let's drop the "real sysadmin"
ideology. We are humans. We like things to work for us. "Too easy" is
not a valid criticism for not having something.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
[not found] <518072682.2617983.1461760017772.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-04-27 12:26 ` matthew patton
2016-04-27 21:28 ` Xen
0 siblings, 1 reply; 29+ messages in thread
From: matthew patton @ 2016-04-27 12:26 UTC (permalink / raw)
To: LVM general discussion and development
It is not the OS' responsibility to coddle stupid sysadmins. If you're not watching for high-water marks in FS growth vis a vis the underlying, you're not doing your job. If there was anything more than the remotest chance that the FS would grow to full size it should not have been thin in the first place.
The FS already has a notion of 'reserved'. man(1) tune2fs -r
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [linux-lvm] thin handling of available space
2016-04-23 17:53 Xen
@ 2016-04-27 12:01 ` Xen
0 siblings, 0 replies; 29+ messages in thread
From: Xen @ 2016-04-27 12:01 UTC (permalink / raw)
To: LVM general discussion and development
I was talking about the idea to communicate to a filesystem the amount
of available blocks.
I noticed https://bugzilla.redhat.com/show_bug.cgi?id=1189215 named "LVM
Thin: Handle out of space conditions better" which was resolved by
Zdenek Kabelac (hey Zdenek) and which gave rise to (apparently) the new
warning you get when you overprovision.
But this warning when overprovisioning does not solve any problems in a
running system.
You /still/ want to overprovision AND you want a better way to handle
out of space conditions.
A number of items were suggested in that bug:
1) change the default "resize thin-p at 100%" setting in lvm.conf
2) warn users that they have insufficient space in a pool to cover a
fully used thinLV
3) change default wait time from 60sec after an out-of-space condition
to something longer
Corey Marthaler suggested that only #2 was implemented, and this bug (as
mentioned) was linked in an errata mentioned at the end of the bug.
So since I have already talked about it here with my lengthy rambling
post ;-) I would like to at least here "formally" suggest a #4 and ask
whether I should comment on that bug or supply/submit a new one about
it?
So my #4 would be:
4) communicate and dynamically update a list of free blocks being sent
to the filesystem layer on top of a logical volume (LV) such that the
filesystem itself is aware of shrinking free space.
Logic implies:
- any thin LV seeing more blocks being used causes the other filesystems
in that thin pool to be updated with new available blocks (or numbers)
if this amount becomes less than the filesystem normally would think it
had
- any thin LV that sees blocks being discarded by the filesystem causes
the other filesystems in that thin pool to be updated with newly
available blocks (or numbers) op to the moment that the real available
space agrees once more with the virtual available space (real free >=
virtual free)
Meaning that this feedback would start happening for any thin LV when
the real available space in the thin pool cq. volume group (depending on
how that works at that point in that place in that configuration)
becomes less then the virtual available space for the thin volume (LV)
This would mean that the virtual available space would in effect
dynamically shrink and grow with the real available space as an
envelope.
The filesystem may know this as an adjusted total available space
(number of blocks) or as an adjusted number of unavailable blocks. It
would need to integrate this in its free space calculation. For a user
tool such as "df" there are 3 ways to update this changing information:
1. dynamically adjust the total available blocks
2. dynamically adjust the amount of free blocks
3. introduce a new field of "unavailable"
Traditional "df" is "total = used + free", the new one would be "total =
used + free + unavailable".
For any user tool not working in blocks but simply available space
(bytes) likely only the amount of free space being reported, would
change.
One may choose to hide the information in "df" and introduce a new flag
that shows unavailable as well?
Then only the amount of free blocks reported, would change, and the
numbers just don't add up visibly.
It falls along the line of the "discard" family of communications that
were introduced in 2008 (https://lwn.net/Articles/293658/).
I DO NOT KNOW if this already exists but I suppose it doesn't. I do not
know a lot about the filesystem system. I just took the liberty to ask
Jonathan Corwell erm Corbet whether this is possible :p.
Anyway, hopefully I am not being too much of a pain here. Regards.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [linux-lvm] thin handling of available space
@ 2016-04-23 17:53 Xen
2016-04-27 12:01 ` Xen
0 siblings, 1 reply; 29+ messages in thread
From: Xen @ 2016-04-23 17:53 UTC (permalink / raw)
To: Linux lvm
Hi,
So here is my question. I was talking about it with someone, who also
didn't know.
There seems to be a reason against creating a combined V-size that
exceeds the total L-size of the thin-pool. I mean that's amazing if you
want extra space to create more volumes at will, but at the same time
having a larger sum V-size is also an important use case.
Is there any way that user tools could ever be allowed to know about the
real effective free space on these volumes?
My thinking goes like this:
- if LVM knows about allocated blocks then it should also be aware of
blocks that have been freed.
- so it needs to receive some communication from the filesystem
- that means the filesystem really maintains a "claim" on used blocks,
or at least notifies the underlying layer of its mutations.
- in that case a reverse communication could also exist where the block
device communicates to the file system about the availability of
individual blocks (such as might happen with bad sectors) or even the
total amount of free blocks. That means the disk/volume manager (driver)
could or would maintain a mapping or table of its own blocks. Something
that needs to be persistent.
That means the question becomes this:
- is it either possible (theoretically) that LVM communicates to the
filesystem about the real number of free blocks that could be used by
the filesystem to make "educated decisions" about the real availability
of data/space?
- or, is it possible (theoretically) that LVM communicates a "crafted"
map of available blocks in which a certain (algorithmically determined)
group of blocks would be considered "unavailable" due to actual real
space restrictions in the thin pool? This would seem very suboptimal but
would have the same effect.
See if the filesystem thinks it has 6GB available but really there is
only 3GB because data is filling up, does it currently get notified of
this?
What happens if it does fill up?
Funny that we are using GB in this example. I remembered today using
Stacker on MS-DOS disk where I had 20MB available and was able to
increase it to 30MB ;-).
Someone else might use terabytes, but anyway.
If the filesystem normally has a fixed size and this size doesn't change
after creation (without modifying the filesystem) then it is going to
calculate its free space based on its knowledge of available blocks.
So there are three figures:
- total available space
- real available space
- data taken up by files.
total - data is not always real, because there may still be handles on
deleted files, etc., open. Visible, countable files and its "du" +
blocks still in use + available blocks should be ~ total blocks.
So we are only talking about blocks here, nothing else.
And if LVM can communicate about availability of blocks, a fourth figure
comes into play:
total = used blocks + unused blocks + unavailable blocks.
If LVM were able to dynamically adjust this last figure, we might have a
filesystem that truthfully reports actual available space. In a thin
setting.
I do not even know whether this is not already the case, but I read
something that indicated an importance of "monitoring available space"
which would make the whole situation unusable for an ordinary user.
Then you would need GUI applets that said "The space on your thin volume
is running out (but the filesystem might not report it)".
So question is:
* is this currently 'provisioned' for?
* is this theoretically possible, if not?
If you take it to a tool such as "df"
There are only three figures and they add up.
They are:
total = used + available
but we want
total = used + available + unavailable
either that or the total must be dynamically be adjusted, but I think
this is not a good solution.
So another question:
*SHOULDN'T THIS simply be a feature of any filesystem?*
The provision of being able to know about the *real* number of blocks in
case an underlying block device might not be "fixed, stable, and
unchanging"?
The way it is you can "tell" Linux filesystems with fsck which blocks
are bad blocks and thus unavailable, probably reducing the number of
"total" blocks.
From a user interface perspective, perhaps this would be an ideal
solution, if you needed any solution at all. Personally I would probably
prefer either the total space to be "hard limited" by the underlying
(LVM) system, or for df to show a different output, but df output is
often parsed by scripts.
In the former case supposing a volume was filling up.
udev 1974288 0 1974288 0% /dev
tmpfs 404384 41920 362464 11% /run
/dev/sr2 1485120 1485120 0 100% /cdrom
(Just taking 3 random filesystems)
One filesystem would see "used" space go up. The other two would see
"total" size going down, in addition to the other one, also seeing that
figure go down. That would be counterintuitive and you cannot really do
this.
It's impossible to give this information to the user in a way that the
numbers still add up.
Supposing:
real size 2000
1000 500 500
1000 500 500
1000 500 500
combined virtual size 3000. Total usage 1500. Real free 500. Now the
first volume uses another 250.
1000 750 250
1000 500 250
1000 500 250
The numbers no longer add up for the 2nd and 3rd system.
You *can* adjust total in a way that it still makes sense (a bit)
1000 750 250
750 500 250
750 500 250
You can also just ignore the discrepancy, or add another figure:
total used unav avail
1000 750 0 250
1000 500 250 250
1000 500 250 250
Whatever you do, you would have to simply calculate this adjusted number
from the real number of available blocks.
Now the third volume takes another 100
First style:
1000 750 150
1000 500 150
1000 600 150
Second style:
1000 750 150
650 500 150
750 600 150
Third style:
total used unav avail
1000 750 100 150
1000 500 350 150
1000 600 250 150
There's nothing technically inconsistent about it, it is just rather
difficult to grasp at first glance.
df uses filesystem data, but we are really talking about
block-layer-level-data now.
You would either need to communicate the number of available blocks (but
which ones?) and let the filesystem calculate unavailable --- or
communicate the number of unavailable blocks at which point you just do
this calculation yourself. For each volume you reach a different number
of "blocks" you need to withhold.
If you needed to make those blocks unavailable, you would now randomly
(or at the end of the volume, or any other method) need to "unavail"
those to the filesystem layer beneath (or above).
Every write that filled up more blocks would be communicated to you,
(since you receive the write or the allocation) and would result in an
immediate return of "spurious" mutations or an updated number of
unavailable blocks -- and you can also communicate both.
On every new allocation, the filesystem would be returned blocks that
you have "fakely" marked as unavailable. All of this only happens if
available real space becomes less than that of the individual volumes
(virtual size). The virtual "available" minus the "real available" is
the number of blocks (extents) you are going to communicate as being
"not there".
At every mutation from the filesystem, you respond with a like mutation:
not to the filesystem that did the mutation, but to every other
filesystem on every other volume.
Space being freed (deallocated) then means a reverse communication to
all those other filesystems/volumes.
But it would work, if this was possible. This is the entire algorithm.
I'm sorry if this sounds like a lot of "talk" and very little "doing"
and I am annoyed by that as well. Sorry about that. I wish I could
actually be active with any of these things.
I am reminded of my father. He was in school for being a car mechanic
but he had a scooter accident days before having to do his exam. They
did the exam with him in a (hospital) bed. He only needed to give
directions on what needed to be done and someone else did it for him :p.
That's how he passed his exam. It feels the same way for me.
Regards.
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-05-04 18:16 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1684768750.3193600.1461851163510.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-28 13:46 ` [linux-lvm] thin handling of available space matthew patton
[not found] <799090122.6079306.1462373733693.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-04 14:55 ` matthew patton
2016-05-03 18:19 Xen
[not found] <1614984310.1700582.1462280490763.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-03 13:01 ` matthew patton
2016-05-03 15:47 ` Xen
2016-05-04 0:56 ` Mark Mielke
[not found] <1870050920.5354287.1462276845385.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-03 12:00 ` matthew patton
2016-05-03 14:38 ` Xen
2016-05-04 1:25 ` Mark Mielke
2016-05-04 18:16 ` Xen
[not found] <929635034.3140318.1461840230292.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-28 10:43 ` matthew patton
2016-04-28 18:20 ` Xen
2016-04-28 18:25 ` Xen
2016-04-29 11:23 ` Zdenek Kabelac
2016-05-02 14:32 ` Mark Mielke
2016-05-03 9:45 ` Zdenek Kabelac
2016-05-03 10:41 ` Mark Mielke
2016-05-03 11:18 ` Zdenek Kabelac
2016-05-03 10:15 ` Gionatan Danti
2016-05-03 11:42 ` Zdenek Kabelac
2016-05-03 13:15 ` Gionatan Danti
2016-05-03 15:45 ` Zdenek Kabelac
2016-05-03 12:42 ` Xen
[not found] <518072682.2617983.1461760017772.JavaMail.yahoo.ref@mail.yahoo.com>
2016-04-27 12:26 ` matthew patton
2016-04-27 21:28 ` Xen
2016-04-28 6:46 ` Marek Podmaka
2016-04-28 10:33 ` Xen
-- strict thread matches above, loose matches on Subject: below --
2016-04-23 17:53 Xen
2016-04-27 12:01 ` Xen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.